Gene methylation and expression

ABSTRACT

The invention provides a method of analyzing the methylation status of all or part of an entire genome. Moreover, the invention features methods of and reagents for characterizing biological cells containing DNA that is susceptible to methylation. Such methods include methods of diagnosing cancer, e.g., breast cancer.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims priority to U.S. Provisional Application No.60/685,104, filed May 27, 2005. The entire content of the priorapplication is incorporated herein by reference in its entirety.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH

The research described in this application was supported in part bygrants (Nos. CA89393 and CA94074) from the National Cancer Institute ofthe National Institutes of Health, and grants Nos. DAMD 17-02-1-0692 andW8IXWH-04-1-0452) from the Department of Defense. Thus the governmenthas certain rights in the invention.

TECHNICAL FIELD

This invention relates to epigenetic gene regulation, and moreparticularly to DNA methylation and its effect on gene expression, andits use as a marker of a particular cell type and/or disease state.

BACKGROUND

Epigenetic changes (e.g., changes in the levels of DNA methylation), aswell as genetic changes, can be detected in cancer cells and stromalcells within tumors. In order to develop more discriminatory diagnosticmethods and more effective therapeutic methods it is important thatthese epigenetic effects be defined and characterized.

SUMMARY

The inventors have developed a method of assessing the level ofmethylation in an entire, or part of a, genome. They call this methodMethylation Specific Digital Karyotyping (MSDK). The MSDK method can beadapted to establish a test genomic methylation profile for a test cellof interest. By comparing the test profile to control profiles obtainedwith defined cells types, the test cell can be identified. The MSDKmethod can also be used to identify genes in a test cell (e.g., a cancercell) the methylation of which is altered (increased or decreased)relative to a corresponding control cell (e.g., a normal cell of thesame tissue as the cancer cell). This information provides the basis formethods for discriminating whether a test cell of interest (a) is thesame as a control cell (e.g., a normal cell) or (b) is different from acontrol cell but is, for example, a pathologic cell such as a cancercell. Such methods include, for example, assessing the level of DNAmethylation or the level of expression of genes of interest, or thelevel of DNA methylation in a particular chromosomal area in test cellsand comparing the results to those obtained with control cells.

More specifically, the invention features a method of making amethylation specific digital karyotyping (MSDK) library. The methodincludes:

providing all or part of the genomic DNA of a test cell; exposing theDNA to a methylation-sensitive mapping restriction enzyme (MMRE) togenerate a plurality of first fragments;

conjugating to one terminus or to both termini of each of the firstfragments a binding moiety, the binding moiety comprising a first memberof an affinity pair, the conjugating resulting in a plurality of secondfragments;

exposing the plurality of second fragments to a fragmenting restrictionenzyme (FRE) to generate a plurality of third fragments, each thirdfragment containing at one terminus the first member of the affinitypair and at the other terminus the 5′ cut sequence of the FRE or the 3′cut sequence of the FRE;

contacting the plurality of third fragments with an insoluble substratehaving bound thereto a plurality of second members of the affinity pairto the contacting resulting in a plurality of bound third fragments,each bound third fragment being a third fragment bound via the first andsecond members of the affinity pair to the insoluble substrate;

conjugating to free termini of the bound third fragments a releasingmoiety, the releasing moiety comprising a releasing restriction enzyme(RRE) recognition sequence and, 3′ of the recognition sequence of theRRE, either the 5′ cut sequence of the FRE or the 3′ cut sequence of theFRE, the conjugating resulting in a plurality of bound fourth fragments,each bound fourth fragment (i) containing at one terminus therecognition sequence of the RRE and (ii) being bound via the firstmember of the affinity pair at the other terminus and the second memberof the affinity pair to the insoluble substrate; and

exposing the bound fourth fragments to the RRE, the exposing resultingin the release from the insoluble substrate of a MSDK library, thelibrary comprising a plurality of fifth fragments, each fifth fragmentcomprising the releasing moiety and a MSDK tag, the tag consisting of aplurality of base pairs of the genomic DNA. Thus, the method results inthe production of a plurality of MSDK tags.

In the method, the MMRE can be, e.g., AscI, the FRE can be, e.g.,NlaIII, and the RRE can be, e.g., MmeI. The binding moiety can furtherinclude a 5′ or 3′ cut sequence of the MMRE. The binding moiety can alsofurther include, between the 5′ or 3′ recognition sequence of the MMREand the first member of an affinity pair, a linker nucleic acid sequencecomprising a plurality of base pairs. The releasing moiety can furtherinclude, 5′ of the RRE recognition sequence, an extender nucleic acidsequence comprising a plurality of base pairs. The test cell can be avertebrate cell and the vertebrate test cell can be a mammalian testcell, e.g., a human test cell. Moreover the test cell can be a normalcell or, for example, a cancer cell, e.g., a breast cancer cell. Thefirst member of the affinity pair can be biotin, iminobiotin, avidin ora functional fragment of avidin, an antigen, a haptenic determinant, asingle-stranded nucleotide sequence, a hormone, a ligand for adhesionreceptor, a receptor for an adhesion ligand, a ligand for a lectin, alectin, a molecule containing all or part of an immunoglobulin Fcregion, bacterial protein A, or bacterial protein G. The insolublesubstrate can include, or be, magnetic beads.

Also provided by the invention is a method of analyzing a MSDK library.The method includes: providing a MSDK library made by theabove-described method; and identifying the nucleotide sequences of onetag, a plurality of tags, or all of the tags. Identifying the nucleotidesequences of a plurality of tags can involve: making a plurality ofditags, each ditag containing two fifth fragments ligated together;forming a concatamer containing a plurality of ditags or ditagfragments, wherein each ditag fragment contains two MSDK tags;determining the nucleotide sequence of the concatamer; and deducing,from the nucleotide sequence of the concatamer, the nucleotide sequencesof one or more of the MSDK tags that the concatamer contains. The ditagfragments can be made by exposing the ditags to the FRE. The method canfurther include, after making a plurality of ditags and prior to formingthe concatamers, the number (abundance) of individual ditags isincreased by PCR. The method can further include determining therelative frequency of some or all of the tags.

Another aspect of the invention is an additional method of analyzing aMSDK library. The method includes: providing a MSDK library made by theabove-described method; identifying a chromosomal site corresponding tothe sequence of a tag selected from the library. The method can furtherinvolve determining a chromosomal location, in the genome of the testcell, of an unmethylated full recognition sequence of the MMRE closestto the identified chromosomal site. These two steps can be repeated witha plurality of tags obtained from the library in order to determine thechromosomal location of a plurality of unmethylated recognitionsequences of the MMRE. The identification of the chromosomal site andthe determination of the chromosomal location can be performed by aprocess that includes comparing the nucleotide sequence of the selectedtag to a virtual tag library generated using the nucleotide sequence ofthe genome or the part of a genome, the nucleotide sequence of the fullrecognition sequence of the MMRE, the nucleotide sequence of the fullrecognition sequence of the FRE, and the number of nucleotidesseparating the full recognition sequence of the RRE from the RRE cuttingsite.

In another aspect, the invention provides a method of classifying abiological cell. The method includes: (a) identifying the nucleotidesequences of one tag, a plurality of tags, or all of the tags in an MSDKlibrary made as described above and determining the relative frequencyof some or all of the tags, thereby obtaining a test MSDK profile forthe test cell; (b) comparing the test MSDK profile to separate controlMSDK expression profiles for one or more control cell types; (c)selecting a control MSDK profile that most closely resembles the testMSKD profile; and (d) assigning to the test cell a cell type thatmatches the cell type of the control MSDK profile selected in step (c).The test and control cells can be vertebrate cells, e.g., mammaliancells such as human cells. The control cell types can include a controlnormal cell and a control cancer cell of the same tissue as the normalcell. The control normal cell and the control cancer cell can be breastcells or of a tissue selected from colon, lung, prostate, and pancreas.The test cell can be a breast cell or of a tissue selected from ofcolon, lung, prostate, and pancreas. The control cell types can includecells of different categories of a cancer of a single tissue and thedifferent categories of a cancer of a single tissue can include, forexample, a breast ductal carcinoma in situ (DCIS) cell and an invasivebreast cancer cell. The different categories of a cancer of a singletissue can alternatively include, for example, two or more of: a highgrade DCIS cell, an intermediate grade DCIS cell; and a low grade DCIScell. The control cell types can include two or more of: a lung cancercell; a breast cancer cell; a colon cancer cell; a prostate cancer cell;and a pancreatic cancer. In addition, the control cell types can includean epithelial cell obtained from non-cancerous tissue and amyoepithelial cell obtained from non-cancerous tissue. Furthermore, thecontrol cells can also include stem cells and differentiated cellsderived therefrom (e.g., epithelial cells or myoepithelial cells) of thesame tissue type. The control stem and differentiated cells therefromcan be of breast tissue, or of a tissue selected from colon, lung,prostate, and pancreas. The control stem and differentiated cellsderived therefrom can be normal or cancer cells (e.g., breast cancercells) or obtained from a cancerous tissue (e.g., breast cancer).

Another embodiment of the invention is a method of diagnosis. The methodincludes: (a) providing a test breast epithelial cell; (b) determiningthe degree of methylation of one or more C residues in a DNA sequence(e.g., in a gene) in the test cell, wherein the DNA (e.g., the gene) isselected from the AscI sites identified by the MSDK tags listed in Table5, wherein the one or more C residues are C residues in CpG sequences;and (c) comparing the degree of methylation of the one or more residuesto the degree of methylation of corresponding one or more C residues ina corresponding gene in a control epithelial cell obtained fromnon-cancerous breast tissue, wherein an altered degree of methylation ofthe one or more C residues in the test epithelial cell compared to thecontrol epithelial cell is an indication that the test epithelial cellis a cancer cell. The altered degree of methylation can be a lowerdegree of methylation or a higher degree of methylation. The altereddegree of methylation can be in the promoter region of the gene, an exonof the gene, an intron of the gene, or a region outside of the gene(e.g., in an intergenic region). The gene can be, for example, PRDM14 orZCCHC14.

The invention provides another method of diagnosis. The method includes:

(a) providing a test colon epithelial cell; (b) determining the degreeof methylation of one or more C residues in a DNA sequence (e.g., in agene) in the test cell, wherein the DNA sequence (e.g., the gene) isselected from those identified by the MSDK tags listed in Table 2,wherein the one or more C residues are C residues in CpG sequences; and(c) comparing the degree of methylation of the one or more residues tothe degree of methylation of corresponding one or more C residues in acorresponding gene in a control epithelial cell obtained fromnon-cancerous colon tissue, wherein an altered degree of methylation ofthe one or more C residues in the test epithelial cell compared to thecontrol epithelial cell is an indication that the test epithelial cellis a cancer cell. The altered degree of methylation can be a lowerdegree of methylation or a higher degree of methylation. In addition,the altered degree of methylation can be in the promoter region of thegene, an exon of the gene, an intron of the gene, or a region outside ofthe gene (e.g., an intergenic region). The gene can be, for example,LHX3, TCF7L1, or LMX-1A.

Another method of diagnosis featured by the invention involves: (a)providing a test myoepithelial cell obtained from a test breast tissue;(b) determining the degree of methylation of one or more C residues in aDNA sequence (e.g., in a gene) in the test cell, wherein the DNAsequence (e.g., the gene) is selected from those identified by the MSDKtags listed in Table 10, wherein the one or more C residues are Cresidues in CpG sequences; and (c) comparing the degree of methylationof the one or more residues to the degree of methylation ofcorresponding one or more C residues in a corresponding gene in acontrol myoepithelial cell obtained from non-cancerous breast tissue,wherein an altered degree of methylation of the one or more C residuesin the test myoepithelial cell compared to the control myoepithelialcell is an indication that the test breast tissue is cancerous tissue.The altered degree of methylation can be a lower degree of methylationor a higher degree of methylation. In addition, the altered degree ofmethylation can be in the promoter region of the gene, an exon of thegene, an intron of the gene, or a region outside of the gene (e.g., anintergenic region). The gene is can be, for example, HOXD4, SLC9A3R1, orCDC42EP5.

Yet another method of diagnosis embodied by the invention involves:

(a) providing a test fibroblast obtained from a test breast tissue; (b)determining the degree of methylation of one or more C residues in a DNAsequence (e.g., in a gene) in the test cell, wherein the DNA sequence(e.g., the gene) is selected from those identified by the MSDK tagslisted in Tables 7 and 8, wherein the one or more C residues are Cresidues in CpG sequences; and (c) comparing the degree of methylationof the one or more residues to the degree of methylation ofcorresponding one or more C residues in a corresponding gene in acontrol fibroblast obtained from non-cancerous breast tissue, wherein analtered degree of methylation of the one or more C residues in the testfibroblast compared to the control fibroblast is an indication that thetest breast tissue is cancerous tissue. The altered degree ofmethylation can be a lower degree of methylation or a higher degree ofmethylation. In addition, the altered degree of methylation can be inthe promoter region of the gene, an exon of the gene, an intron of thegene, or a region outside of the gene (e.g., an intergenic region). Thegene can be, for example, Cxorf12.

In another aspect, the invention includes a method of determining thelikelihood of a cell being an epithelial cell or a myoepithelial cell.The method involves:

(a) providing a test cell; (b) determining the degree of methylation ofone or more C residues in a DNA sequence (e.g., in a gene) in the testcell, wherein the DNA sequence (e.g., the gene) is selected from thoseidentified by the MSDK tags listed in Table 12, wherein the one or moreC residues are C residues in CpG sequences; and (c) comparing the degreeof methylation of the one or more residues to the degree of methylationof corresponding one or more C residues in a corresponding gene in acontrol myoepithelial cell and to the degree of methylation ofcorresponding one or more C residues in a corresponding gene in acontrol epithelial cell, wherein the test cell is: (i) more likely to bea myoepithelial cell if the degree of methylation in the test samplemore closely resembles the degree of methylation in the controlmyoepithelial cell; or (ii) more likely to be an epithelial cell if thedegree of methylation in the test sample more closely resembles thedegree of methylation in the control epithelial cell. The C residues canbe in the promoter region of the gene, an exon of the gene, an intron ofthe gene, or in a region outside of the gene (e.g., an intergenicregion). The gene can be, for example, LOC389333 or CDC42EP5.

In another aspect, the invention includes a method of determining thelikelihood of a cell being a stem cell, an differentiated luminalepithelial cell or a myoepithelial cell. The method involves: (a)providing a test cell; (b) determining the degree of methylation of oneor more C residues in a DNA sequence (e.g., in a gene) in the test cell,wherein the DNA sequence (e.g., the gene) is selected from thoseidentified by the MSDK tags listed in Table 15 or 16, wherein the one ormore C residues are C residues in CpG sequences; and (c) comparing thedegree of methylation of the one or more residues to the degree ofmethylation of corresponding one or more C residues in a correspondinggene in a control stem cell, to the degree of methylation ofcorresponding one or more C residues in a corresponding gene in acontrol differentiated luminal epithelial cell, and to the degree ofmethylation of corresponding one or more C residues in a correspondinggene in a control myoepithelial cell, wherein the test cell is: (i) morelikely to be a stem cell if the degree of methylation in the test samplemore closely resembles the degree of methylation in the control stemcell; (ii) more likely to be a differentiated luminal epithelial cell ifthe degree of methylation in the test sample more closely resembles thedegree of methylation in the control epithelial cell; or (iii) morelikely to be a myoepithelial cell if the degree of methylation in thetest sample more closely resembles the degree of methylation in thecontrol myoepithelial cell. The C residues can be in the promoter regionof the gene, an exon of the gene, an intron of the gene, or in a regionoutside of the gene (e.g., an intergenic region). The gene can be, forexample, SOX13, SLC9A3R1, FNDC1, FOXC1, PACAP, DDN, CDC42EP5, LHX1, andHOXA10.

The invention also features a method of diagnosis that involves: (a)providing a test cell from a test tissue; (b) determining the degree ofmethylation of one or more C residues in a PRDM14 gene in the test cell,wherein the one or more C residues are C residues in CpG sequences; and(c) comparing the degree of methylation of the one or more residues tothe degree of methylation of corresponding one or more C residues in thePRDM14 gene in a control cell obtained from non-cancerous tissue of thesame tissue as the test cell, wherein an altered degree of methylationof the one or more C residues in the test cell compared to the controlcell is an indication that the test cell is a cancer cell. The altereddegree of methylation can be a lower degree of methylation or a higherdegree of methylation. In addition, the altered degree of methylationcan be in the promoter region of the gene, an exon of the gene, anintron of the gene, or a region outside of the gene (e.g., an intergenicregion). The test and control cells can be breast cells or of a tissueselected from colon, lung, prostate, and pancreas.

Another embodiment of the invention is a method of diagnosis thatincludes: (a) providing a test sample of breast tissue comprising a testepithelial cell; (b) determining the level of expression in the testepithelial cell of a gene selected from those listed in Table 5, whereinthe gene is one that is expressed in a breast cancer epithelial cell ata substantially altered level compared to a compared to a normal breastepithelial cell; and (c) classifying the test cell as: (i) a normalbreast epithelial cell if the level of expression of the gene in thetest cell is not substantially altered compared to a control level ofexpression for a normal breast epithelial cell; or (ii) a breast cancerepithelial cell if the level of expression of the gene in the test cellis substantially altered compared to a control level of expression for anormal breast epithelial cell. The gene is can be, for example, PRDM14or ZCCHC14. The alteration in the level of expression can be an increasein the level of expression or a decrease in the level of expression.

Another aspect of the invention is a method of diagnosis that includes:

(a) providing a test sample of colon tissue comprising a test epithelialcell;(b) determining the level of expression in the test epithelial cell of agene selected from those listed in Table 2, wherein the gene is one thatis expressed in a colon cancer epithelial cell at a substantiallyaltered level compared to a compared to a normal colon epithelial cell;and (c) classifying the test cell as: (i) a normal colon epithelial cellif the level of expression of the gene in the test cell is notsubstantially altered compared to a control level of expression for anormal colon epithelial cell; or (ii) a colon cancer epithelial cell ifthe level of expression of the gene in the test cell is substantiallyaltered compared to a control level of expression for a normal colonepithelial cell. The gene can be, for example, LHX3, TCF7L1, or LMX-1A.The alteration in the level of expression can be an increase in thelevel of expression or a decrease in the level of expression.

Another method of diagnosis included in the invention involves: (a)providing a test sample of breast tissue comprising a test stromal cell;(b) determining the level of expression in the stromal cell of a geneselected from those listed in Tables 7, 8, and 10, wherein the gene isone that is expressed in a cell of the same type as the test stromalcell at a substantially altered level when present in breast cancertissue than when present in normal breast tissue; and (c) classifyingthe test sample as: (i) normal breast tissue if the level of expressionof the gene in the test stromal cell is not substantially alteredcompared to a control level of expression for a control cell of the sametype as the test stromal cell in normal breast tissue; or (ii) breastcancer tissue if the level of expression of the gene in the test stromalcell is substantially altered compared to a control level of expressionfor a control cell of the same type as the test stromal cell in normalbreast tissue. The test and control stromal cells can be myoepithelialcells and the genes can be those listed in Table 10, e.g., HOXD4,SLC9A3R1, or CDC32EP5. Alternatively, the test and control stromal cellscan be fibroblasts and the genes can be those listed in Tables 7 and 8,e.g., Cxorf1. The alteration in the level of expression can be anincrease in the level of expression or a decrease in the level ofexpression.

In another aspect, the invention includes a method of determining thelikelihood of a cell being an epithelial cell or a myoepithelial cell.The method includes: (a) providing a test cell; (b) determining thelevel of expression in the test sample of a gene selected from the groupconsisting of those identified by the MSDK tags listed in Table 12; (c)determining whether the level of expression of the selected gene in thetest sample more closely resembles the level of expression of theselected gene in (i) a control myoepithelial cell or (ii) a controlepithelial cell; and (d) classifying the test cell as: (i) likely to bea myoepithelial cell if the level of expression of the gene in the testcell more closely resembles the level of expression of the gene in acontrol myoepithelial cell; or (ii) likely to be an epithelial cell ifthe level of expression of the gene in the test cell more closelyresembles the level of expression of the gene in a control epithelialcell. The gene can be, for example, LOC389333 or CDC42EP5.

In another aspect, the invention includes a method of determining thelikelihood of a cell being a stem cell, a differentiated luminalepithelial cell, or a myoepithelial cell. The method includes: (a)providing a test cell; (b) determining the level of expression in thetest sample of a gene selected from the group consisting of thoseidentified by the MSDK tags listed in Table 15 or 16; (c) determiningwhether the level of expression of the selected gene in the test samplemore closely resembles the level of expression of the selected gene in(i) a control stem cell, (ii) a control differentiated luminalepithelial cell, or (iii) a control myoepithelial cell; and (d)classifying the test cell as: (i) likely to be a stem cell if the levelof expression of the gene in the test cell more closely resembles thelevel of expression of the gene in a control stem cell; (ii) likely tobe an differentiated luminal epithelial cell if the level of expressionof the gene in the test cell more closely resembles the level ofexpression of the gene in a control differentiated luminal epithelialcell, or (iii) likely to be a myoepithelial cell if the level ofexpression of the gene in the test cell more closely resembles the levelof expression of the gene in a control myoepithelial cell. The gene canbe, for example, SOX13, SLC9A3R1, FNDC1, FOXC1, PACAP, DDN, CDC42EP5,LHX1, and HOXA10.

Also embodied by the invention is a method of diagnosis that includes:

(a) providing a test cell; (b) determining the level of expression inthe test cell of a PRDM14 gene; and (c) classifying the test cell as:(i) a normal cell if the level of expression of the gene in the testcell is not substantially altered compared to a control level ofexpression for a control normal cell of the same tissue as the testcell; or (ii) a cancer cell if the level of expression of the gene inthe test cell is substantially altered compared to a control level ofexpression for a control normal cell of the same tissue as the testcell. The alteration in the level of expression can be an increase inthe level of expression or a decrease in the level of expression. Thetest and control cells can be breast cells or of a tissue selected fromcolon, lung, prostate, and pancreas.

The invention also provides a single stranded nucleic acid probe thatincludes: (a) the nucleotide sequence of a tag selected from thoselisted in Tables 2, 5, 7, 8, 10, 12, 15 and 16; (b) the complement ofthe nucleotide sequence; or (c) the AscI sites defined by the MSDK tagslisted in Tables 2, 5, 7, 8, 10, 12, 15, and 16.

In another aspect, there is provided an array containing a substratehaving at least 10, 25, 50, 100, 200, 500, or 1,000 addresses, whereineach address has disposed thereon a capture probe that includes: (a) anucleic acid sequence consisting of a tag nucleotide sequence selectedfrom those listed in Tables 2, 5, 7, 8, 10, 12, 15 and 16; (b) thecomplement of the nucleic acid sequence; or (c) the AscI sites definedby the MSDK tags listed in Tables 2, 5, 7, 8, 10, 12, 15, and 16.

The invention also features a kit comprising at least 10, 25, 50, 100,200, 500, or 1,000 probes, each probe containing: (a) a nucleic acidsequence comprising a tag nucleotide sequence selected from those listedin Tables 2, 5, 7, 8, 10, 12, 15 and 16; (b) the complement of thenucleic acid sequence; (c) the AscI sites defined by the MSDK tagslisted in Tables 2, 5, 7, 8, 10, 12, 15, and 16.

Another aspect of the invention is kit containing at least 10, 25, 50,100, 200, 500, or 1,000 antibodies each of which is specific for adifferent protein encoded by a gene identified by a tag selected fromthe group consisting of the tags listed in Tables 2, 5, 7, 8, 10, 12, 15and 16.

As used herein, an “affinity pair” is any pair of molecules that have anintrinsic ability to bind to each other. Thus, affinity pairs include,without limitation, any receptor/ligand pair, e.g., vitamins (e.g.,biotin)/vitamin-binding proteins (e.g., avidin or streptavidin);cytokines (e.g., interleukin-2)/cytokine receptors (e.g.,interleukin-2); hormones (e.g., steroid hormones)/hormone receptors(e.g., steroid hormone receptors); signal transduction ligands/signaltransduction receptors; adhesion ligands/adhesion receptors; deathdomain molecule-binding ligands/death domain molecules; lectins (e.g.,pokeweed mitogen, pea lectin, concanavalin A, lentil lectin,phytohemagglutinin (PHA) from Phaseolus vulgaris, peanut agglutinin,soybean agglutinin, Ulex europaeus agglutinin-I, Dolichos biflorusagglutinin, Vicia villosa agglutinin and Sophora japonicaagglutinin/lectin receptors (e.g., carbohydrate lectin receptors);antigens or haptens (e.g., trinitrophenol or biotin)/antibodies (e.g.,antibody specific for trinitrophenol or biotin); immunoglobulin Fcfragments/immunoglobulin Fc fragment binding proteins (e.g., bacterialprotein A or protein G). Ligands can serve as first or second members ofan affinity pair, as can receptors. Where a ligand is used as the firstmember of the affinity pair the corresponding receptor is used as thesecond member of the affinity pair and where a receptor is used as thefirst member of the affinity pair, the corresponding receptor is used asthe second member of the affinity pair. Functional fragments ofpolypeptide first and second members of affinity pairs are fragments ofthe full-length, mature first or second members that are shorter thanthe full-length, mature first or second members but have at least 25%(e.g., at least: 30%; 40%; 50%; 60%; 70%; 80%; 90%; 95%; 98%; 99%;99.5%; 100%; or even more) of the ability of the full-length, maturefirst or second members to bind to corresponding second or firstmembers, respectively.

The nucleotide sequences of all the identified genes in Tables 2, 5, 7,8, 10, 12, 15 and 16 are available on public genetic databases (e.g.,GeneBank). These sequences are incorporated herein by reference.

As used herein, a “substantially altered” level of expression of a genein a first cell (or first tissue) compared to a second cell (or secondtissue) is an at least 2-fold (e.g., at least: 2-; 3-; 4-; 5-; 6-; 7-;8-; 9-; 10-; 15-; 20-; 30-; 40-; 50-; 75-; 100-; 200-; 500-; 1,000-;2000-; 5,000-; or 10,000-fold) altered level of expression of the gene.It is understood that the alteration can be an increase or a decrease.

As used herein, breast “stromal cells” are breast cells other thanepithelial cells.

Unless otherwise defined, all technical and scientific terms used hereinhave the same meaning as commonly understood by one of ordinary skill inthe art to which this invention pertains. In case of conflict, thepresent document, including definitions, will control. Preferred methodsand materials are described below, although methods and materialssimilar or equivalent to those described herein can be used in thepractice or testing of the present invention. All publications, patentapplications, patents and other references mentioned herein areincorporated by reference in their entirety. The materials, methods, andexamples disclosed herein are illustrative only and not intended to belimiting.

Other features and advantages of the invention, e.g., assessing themethylation of an entire genome, will be apparent from the followingdescription, from the drawings and from the claims.

DESCRIPTION OF DRAWINGS

FIG. 1 is a diagrammatic representation of the generation of arestriction enzyme 5′ cut sequence and 3′ cut sequence by therestriction enzyme cutting DNA at the restriction enzyme's recognitionsequence. In the diagram are shown the two strands of a segment ofdouble stranded DNA containing a restriction enzyme recognition sequencein which each of the nucleotides constituting the recognition sequenceare shown as an N. The exemplary restriction enzyme recognition sequencein the diagram is a six base pair recognition sequence and cutting bythe particular restriction enzyme results in a 3′ two nucleotideoverhang. The N-containing sequences constituting the restriction enzymerecognition sequence and the restriction enzyme's 3′ and 5′ cutsequences are boxed and appropriately labeled. Those skilled in the artwill appreciate that 5′ and 3′ termini generated by the multiplerestriction enzymes available differ greatly (in nucleotide content,whether cohesive termini are generated, and, if they are, in the natureand number of nucleotides in the overhang). Nevertheless, in the sensethat all termini (5′ and 3′ cut sequences) produced by the action ofrestriction enzymes that cut at their recognition sequences consist ofnucleotides derived from the relevant restriction enzyme recognitionsequence, 5′ and 3′ restriction enzyme cut sequences share qualitativefeatures and differ only in how these nucleotides are distributedbetween the 5′ and 3′ cut sequences.

FIG. 2 is a schematic depiction of the MSDK procedure described inExamples 1 and 2.

FIGS. 3-5 are diagrammatic representations of the results of amethylation-detecting sequence analysis of segments of the LHX3 generegion (FIG. 3; SEQ ID NO:3), the LMX-1A gene region (FIG. 4; SEQ IDNO:5), and the TCF7L1 gene region (FIG. 5; SEQ ID NO:4) shown in FIGS.6-8, respectively. The circles represent potential methylation sites(CpG) in the analyzed segment of SEQ ID NOs:3, 5, and 4. The order ofcircles (starting from the left of the rows of circles) is that of theCpG dinucleotides in the analyzed segments of SEQ ID NOs:3, 5 and 4(starting from the 5′ end of the analyzed segment nucleotide sequences).The analyses were performed on DNA from wild-type HCT116 human coloncancer cells (“WT”) and HCT116 cells having both alleles of their DNTM1and DNMT3b methyltransferase genes “knocked out” (“DKO”). Each circle ispie chart with the amount of shading indicating the frequency (0%-100%)at which the relevant potential methylation site was found to bemethylated. The top lines under the circles are linear depictions of therelevant gene transcripts and include the exons (shaded boxes) andintrons (lines between the shaded boxes) and the bottom line under thecircles are linear depictions of the chromosome on which the genes arelocated. On the chromosome depictions are shown the locations of theMSDK tag sequences that indicated the locations of the relevant AscIrecognition sequences, which locations are also shown. The numbering onthe bottom lines indicates the base pair (bp) numbers on the chromosomesand the numbering on the top lines indicate the bp numbers, in thechromosomes, of the transcription start sites and termination sites. Thetranscription initiation sites and the directions of transcription arealso shown.

FIG. 6A is a depiction of the nucleotide sequence (SEQ ID NO:3) of aregion of the LHX3 gene containing the MSDK tag sequence (bold andunderlined) that identified the relevant AscI recognition sequence (incapital letters and underlined) and multiple CpG dinucleotides (shaded).The segment of SEQ ID NO:3 subjected to methylation-detecting sequenceanalysis starts at the nucleotide after the 3′ end of the forward PCRprimer target sequence (shown in italics and underlined) used for thesequencing analysis and ends at the nucleotide before the 3′ end of thereverse PCR primer target sequence (shown in italics and underlined).The sequenced segment spans bp −196 to bp +172 (relative to the LHX3gene transcription initiation site) and thus the last 23 CpG in thesequenced segment are within the promoter region and the first 26 CpGare in exon 1.

FIG. 6B is a depiction of the nucleotide sequence (SEQ ID NO:1545) of aregion of the LHX3 gene within SEQ ID NO:3 containing the relevant AscIsite (bold and underlined) and multiple CpG dinucleotides (shaded).

FIG. 7A is a depiction of the nucleotide sequence (SEQ ID NO:5) of aregion of the LMX-1A gene containing the MSDK tag sequence (bold andunderlined) that identified the relevant AscI recognition sequence (incapital letters and underlined) and multiple CpG dinucleotides (shaded).The segment of SEQ ID NO:5 subjected to methylation-detecting sequenceanalysis starts at the nucleotide after the 3′ end of the forward PCRprimer target sequence (shown in italics and underlined) used for thesequencing analysis and ends at the nucleotide before the 3′ end of thereverse PCR primer target sequence (shown in italics and underlined).The sequenced segment spans bp −842 to bp −609 (relative to the LMX-LAgene transcription initiation site) and thus the whole of the sequencedsegment is within the promoter region.

FIG. 7B is a depiction of the nucleotide sequence (SEQ ID NO:1546) of aregion of the LMX-1A gene within SEQ ID NO:5 containing the relevantAscI recognition sequence (in bold and underlined) and multiple CpGdinucleotides (shaded).

FIG. 8A is a depiction of the nucleotide sequence (SEQ ID NO:4) of aregion of the TCF7L1 gene containing the MSDK tag sequence (bold andunderlined) that identified the relevant AscI recognition sequence (incapital letters and underlined) and multiple CpG dinucleotides (shaded).The segment of SEQ ID NO:4 subjected to methylation-detecting sequenceanalysis starts at the nucleotide after the 3′ end of the forward PCRprimer target sequence (shown in italics and underlined) used for thesequencing analysis and ends at the nucleotide before the 3′ end of thereverse PCR primer target sequence (shown in italics and underlined).The sequenced segment spans bp +782 to bp +1003 (relative to the TCF7L1gene transcription initiation site) and thus the first six CpG in thesequenced segment are within exon 1 and the last 19 CpG are in intron3-4.

FIG. 8B is a depiction of the nucleotide sequence (SEQ ID NO:1547) of aregion of the TCF7L1 gene within SEQ ID NO:4 containing the relevantAscI recognition sequence (in bold and underlined) and multiple CpGdinucleotides (shaded).

FIGS. 9-15 are diagrammatic representations of the results of amethylation-detecting sequence analysis of the segments of,respectively, the PRDM14 gene region (FIG. 9; SEQ ID NO:1), the ZCCHC14gene region (FIG. 10; SEQ ID NO:2), the HOXD4 gene region (FIG. 11; SEQID NO:6), the SLC9A3R1 gene region (FIG. 12; SEQ ID NO:7), the LOC38933gene region (FIG. 13; SEQ ID NO:10), the CDC42EP5 gene region (FIG. 14;SEQ ID NO:8), and the Cxorf12 gene region (FIG. 15; SEQ ID NO:9) shownin FIGS. 16A-22A, respectively. The circles represent potentialmethylation sites (CpG) in the analyzed segments. The order of circles(starting from the left of the rows of circles) is that of the CpGdinucleotides in the analyzed segments (starting from the 5′ end of theanalyzed segment nucleotide sequences). The analyses were performed onDNA from the indicated cell obtained from the indicated samples (seeTable 3). Samples used for the generation of MSDK libraries are markedwith an asterisk. Each circle is a pie chart with the amount of shadingindicating the frequency (0%-100%) at which the relevant potentialmethylation site was found to be methylated. The top (bold) lines underthe circles are linear depictions of the relevant gene transcripts andinclude the exons (shaded boxes) and introns (lines between the shadedboxes) and the bottom lines under the circles are linear depictions ofthe chromosomes on which the genes are located. On the chromosomedepictions are shown the locations of the MSDK tag sequences thatindicated the location of the relevant AscI recognition sequences, whichlocations are also shown. The numbering on the bottom lines indicatesthe bp numbers for the chromosomes and the numbering on the top linesindicate the bp numbers, in the chromosomes, of the transcription startsites and termination sites. The transcription initiation sites and thedirections of transcription are also shown.

FIG. 15 provides the above-listed information for the HCFC1 gene as wellas the Cxorf12 gene. As can be seen for the figure, the two genes arelocated relatively close together on the X chromosome.

FIG. 16A is a depiction of the nucleotide sequence (SEQ ID NO:1) of aregion of the PRDM14 gene containing the relevant AscI recognitionsequence (in capital letters and underlined) and multiple CpGdinucleotides (shaded). The segment of SEQ ID NO:1 subjected tomethylation-detecting sequence analysis starts at the nucleotide afterthe 3′ end of the forward PCR primer target sequence (shown in italicsand underlined) used for the sequencing analysis and ends at thenucleotide before the 3′ end of the reverse PCR primer target sequence(shown in italics and underlined). The sequenced segment spans bp +666to bp +839 (relative to the PRDM14 gene transcription initiation site)and thus the whole sequenced segment is within intron 1-2.

FIG. 16B is a depiction of the nucleotide sequence (SEQ ID NO:1548) of aregion of the PRDM14 gene within SEQ ID NO:1 containing the relevantAscI recognition sequence (in bold and underlined) and multiple CpGdinucleotides (shaded).

FIG. 17A is a depiction of the nucleotide sequence (SEQ ID NO:2) of aregion of the ZCCHC14 gene containing the relevant AscI recognitionsequence (in capital letters and underlined) and multiple CpGdinucleotides (shaded). The segment of SEQ ID NO:2 subjected tomethylation-detecting sequence analysis starts at the nucleotide afterthe 3′ end of the forward PCR primer target sequence (shown in italicsand underlined) used for the sequencing analysis and ends at thenucleotide before the 3′ end of the reverse PCR primer target sequence(shown in italics and underlined). The sequenced segment spans bp +79 tobp +292 (relative to the ZCCHC14 gene transcription initiation site) andthus the last 14 CpG in the sequenced segment are within exon 1 and thefirst 7 CpG are in intron 1-2.

FIG. 17B is a depiction of the nucleotide sequence (SEQ ID NO:1549) of aregion of the ZCCHC14 gene within SEQ ID NO:2 containing the relevantAscI recognition sequence (in bold and underlined) and multiple CpGdinucleotides (shaded).

FIG. 18A is a depiction of the nucleotide sequence (SEQ ID NO:6) of aregion of the HOXD4 gene containing the relevant AscI recognitionsequence (in capital letters and underlined) and multiple CpGdinucleotides (shaded). The segment of SEQ ID NO:6 subjected tomethylation-detecting sequence analysis starts at the nucleotide afterthe 3′ end of the forward PCR primer target sequence (shown in italicsand underlined) used for the sequencing analysis and ends at thenucleotide before the 3′ end of the reverse PCR primer target sequence(shown in italics and underlined). The sequenced segment spans bp +986to bp +1,189 (relative to the HOXD4 gene transcription initiation site)and thus the whole sequenced segment is within intron 1-2.

FIG. 18B is a depiction of the nucleotide sequence (SEQ ID NO:1550) of aregion of the HOXD4 gene within SEQ ID NO:6 containing the relevant AscIrecognition sequence (in bold and underlined) and multiple CpGdinucleotides (shaded).

FIG. 19A is a depiction of the nucleotide sequence (SEQ ID NO:7) of aregion of the SLC9A3R1 gene containing the relevant AscI recognitionsequence (in capital letters and underlined) and multiple CpGdinucleotides (shaded). The segment of SEQ ID NO:7 subjected tomethylation-detecting sequence analysis starts at the nucleotide afterthe 3′ end of the forward PCR primer target sequence (shown in italicsand underlined) used for the sequencing analysis and ends at thenucleotide before the 3′ end of the reverse PCR primer target sequence(shown in italics and underlined). The sequenced segment spans bp+11,713 to bp +11,978 (relative to the SLC9A3R1 gene transcriptioninitiation site) and thus the whole sequenced segment is within intron1-2.

FIG. 19B is a depiction of the nucleotide sequence (SEQ ID NO:1551) of aregion of the SLC9A3R1 gene within SEQ ID NO:7 containing the relevantAscI recognition sequence (in bold and underlined) and multiple CpGdinucleotides (shaded).

FIG. 20A is a depiction of the nucleotide sequence (SEQ ID NO:10) of aregion of the LOC389333 gene containing the relevant AscI recognitionsequence (in capital letters and underlined) and multiple CpGdinucleotides (shaded). The segment of SEQ ID NO:10 subjected tomethylation-detecting sequence analysis starts at the nucleotide afterthe 3′ end of the forward PCR primer target sequence (shown in italicsand underlined) used for the sequencing analysis and ends at thenucleotide before the 3′ end of the reverse PCR primer target sequence(shown in italics and underlined). The sequenced segment spans bp +518to bp +762 (relative to the LOC389333 gene transcription initiationsite) and thus the last 10 CpG in the sequenced segment are within exon1 and the first 21 CpG are within intron 1-2.

FIG. 20B is a depiction of the nucleotide sequence (SEQ ID NO:1552) of aregion of the LOC389333 gene within SEQ ID NO:10 containing the relevantAscI recognition sequence (in bold and underlined) and multiple CpGdinucleotides (shaded).

FIG. 21A is a depiction of the nucleotide sequence (SEQ ID NO:8) of aregion of the CDC42EP5 gene containing the relevant AscI recognitionsequence (in capital letters and underlined) and multiple CpGdinucleotides (shaded). The segment of SEQ ID NO:8 subjected tomethylation-detecting sequence analysis starts at the nucleotide afterthe 3′ end of the forward PCR primer target sequence (shown in italicsand underlined) used for the sequencing analysis and ends at thenucleotide before the 3′ end of the reverse PCR primer target sequence(shown in italics and underlined). The sequenced segment spans bp +7,991to bp +8,193 (relative to the CDC42EP5 gene transcription initiationsite) and thus the whole the sequenced segment is within exon 3.

FIG. 21B is a depiction of the nucleotide sequence (SEQ ID NO:1553) of aregion of the CDC42EP5 gene within SEQ ID NO:8 containing the relevantAscI recognition sequence (in bold and underlined) and multiple CpGdinucleotides (shaded).

FIG. 22A is a depiction of the nucleotide sequence (SEQ ID NO:9) of aregion of the Cxorf12 gene containing the MSDK tag sequence (bold andunderlined) that identified the relevant AscI recognition sequence (incapital letters and underlined) and multiple CpG dinucleotides (shaded).The segment of SEQ ID NO:9 subjected to methylation-detecting sequenceanalysis starts at the nucleotide after the 3′ end of the forward PCRprimer target sequence (shown in italics and underlined) used for thesequencing analysis and ends at the nucleotide before the 3′ end of thereverse PCR primer target sequence (shown in italics and underlined).The sequenced segment spans bp −838 to bp −639 (relative to the Cxorf12gene transcription initiation site) and thus the whole sequenced segmentis within the promoter region.

FIG. 22B is a depiction of the nucleotide sequence (SEQ ID NO:1555) of aregion of the Cxorf12 gene within SEQ ID NO:9 containing the MSDK tagsequence (bold and underlined) that identified the relevant AscIrecognition sequence (in capital letters and underlined) and multipleCpG dinucleotides (shaded).

FIGS. 23A-F are a series of bar graphs showing the results ofquantitative methylation specific PCR (qMSP) analyses of the PRDM14(FIG. 23A), HOXD4 (FIG. 23B), SLC9A3R1 (FIG. 23C), CDC42EP5 (FIG. 23D),LOC389333 (FIG. 23E), and Cxorf12 (FIG. 23F) genes in epithelial cells(left set of normal and tumor cell bars), myoepithelial cells (middleset of normal and tumor cell bars), and fibroblast-enriched stromalcells (right set of normal and tumor cells) isolated from the indicatednormal breast tissue and breast carcinoma samples. The average Ct valuefor each gene was normalized against the ACTB value (see Example 1). Thedata (“Relative methylation (%)”) are percentages relative to the ACTBvalue. Samples used for generation of MSDK libraries are indicated byasterisks. The PRDM14 gene is almost exclusively methylated in tumorepithelial cells and the LOC389333 gene is preferentially methylated inepithelial cells (both tumor and normal) compared to other cell types.The HOXD4, SLC9A3R1, and CDC42EP5 genes, besides being differentiallymethylated between normal and DCIS and myoepithelial cells, are alsomethylated in other cell types. The HOXD4 gene is differentiallymethylated between normal and tumor epithelial cells and frequentlymethylated in stromal fibroblasts, while the SLC9A3R1 and CDC43EP5 genesare frequently methylated in stromal fibroblasts and occasionally inepithelial cells. The Cxorf12 gene is hypermethylated in tumorfibroblast enriched stromal cells compared to normal cells of the sametype and is also methylated in a fraction of epithelial cells.

FIG. 24 is a bar graph showing the results of qMSP analyses of thePRDM14 gene in a panel of normal breast tissues, benign breast tumors(fibroadenomas, papillomas, and fibrocystic disease), and breastcarcinomas. The data were computed as described for FIG. 23. 500% wasset as the upper limit of relative methylation although a few samplesshowed a difference above this threshold.

FIGS. 25A-D are a series of bar graphs showing the results of expressionanalyses of the PRDM14 (FIG. 25A), Cxorf12 (FIG. 25B), CDC42EP5 (FIG.25C), and HOXD4 (FIG. 25D) genes in normal breast and breast carcinoma(tumor) epithelial cells, fibroblast-enriched stromal cells (stroma),and myoepithelial cells and in invasive breast carcinoma cellmyofibroblasts. The average Ct value for each gene was normalizedagainst the RPL39 value (see Example 1). The data (“Relative expression(%)”) are percentages relative to the RPL39 value. Using RPL19 and RPS13values for normalization gave essentially the same results. The PRDM14gene was relatively overexpressed in invasive breast carcinomaepithelial cells. The Corf12 gene was expressed at a relatively higherlevel in normal than in tumor fibroblast-enriched stromal cells. TheCDC42EP5 and HOXD4 genes showed higher expression in DCIS myoepithelialcells and invasive breast carcinoma myofibroblasts compared to normalmyoepithelial cells and also, in the case of the CDC42EP5 gene, tonormal epithelial cells.

FIG. 26A is a schematic representation of the procedure used for tissuefractionation and purification of the various cell types from normalbreast tissue. Cells were captured by antibody-coupled magnetic beads asindicated by the figure.

FIG. 26B is a series of photographs of ethidium bromide-stainedelectrophoretic gels of semi-quantitative RT-PCR analyses of selectedgenes from the purified cell fractions isolated from normal breasttissue. PPIA was used as a loading control. The triangles indicate anincreasing number of PCR cycles (25, 30, and 35).

FIG. 26C is a series of graphs showing the ratio and location ofstatistically significant (p<0.05) tags, generated by MSDK, that aredifferentially methylated in different cell types isolated from normalmammary tissue. Dots corresponding to genes selected for furthervalidation are circled. The X-axis represents the ratio of normalizedtags from the indicated libraries in the various comparisons. CD44/Allindicates the comparison of mammary stem cells (CD44+) against alldifferentiated cells (CD 10+, CD24+, and MUC1+).

FIG. 27A is a series of diagrammatic representations of the results of amethylation-detecting sequence analysis of segments of the SLC9A3R1 generegion, the FNDC1 gene region, the FOXC1 gene region, the PACAP generegion, the DDN gene region, the CDC42EP5 gene region, the LHX1 generegion, the SOX13 gene region, and the DTX gene region. The circlesrepresent potential methylation sites (CpG) in the analyzed segment ofSEQ ID NOs:7, 8, and 11-18. The order of the circles (starting from theleft of the rows of circles) is that of the CpG dinucleotides in theanalyzed segments of SEQ ID NOs:7, 8, and 11-18 (starting from the 5′end of the analyzed segment nucleotide sequences). The analyses wereperformed on DNA isolated from CD44+, CD24+, MUC1+, and CD10+ cellpopulations. Each circle is a pie chart with the amount of shadingindicating the frequency (0-100%) at which the relevant potentialmethylation site was found to be methylated. The top lines under thecircles are linear depictions of the relevant gene transcripts andinclude the exons (shaded boxes) and introns (lines between the shadedboxes) and the bottom line under the circles are linear depictions ofthe chromosome on which the genes are located. On the chromosomedepictions are shown the locations of the MSDK tag sequences thatindicated the locations of the relevant AscI recognition sequences,which locations are also shown. The numbering on the bottom linesindicates the base pair (bp) numbers on the chromosomes and thenumbering on the top lines indicate the bp numbers, in the chromosomes,of the transcription start sites and termination sites. Thetranscription initiation sites and the directions of transcription arealso shown.

FIG. 27B is a series of bar graphs showing the results of quantitativemethylation specific PCR (qMSP) analyses of the SLC9A3R1, FNDC1, FOXC1,PACAP, DDN, CDC42EP5, LHX1, and HOXA10 genes in CD44+, CD10+, MUC1+, andCD24+ cells populations from women of different ages (18-58 years old)and reproductive history. The average Ct value for each gene wasnormalized against the ACTB value. The data (“Relative expression (%)”)are percentages relative to the RPL39 value.

FIG. 28 is a series of bar graphs showing the results of expressionanalyses of the SLC9A3R1, FNDC1, FOXC1, PACAP, DDN, CDC42EP5, LHX1, andHOXA10 genes in CD44+, CD10+, MUC1+, and CD24+ cells isolated fromnormal breast tissue. The average Ct value for each gene was normalizedagainst the RPL39 value. The data (“Relative expression (%)”) arepercentages relative to the RPL39 value.

FIGS. 29A-29B are a series of bar graphs depicting the results ofquantitative methylation specific PCR (qMSP) analyses of DNA from (A)the SLC9A3R1, FNDC1, FOXC1, PACAP, LHX1, and HOXA10 genes in putativebreast cancer stem cells (T-EPCR+) and cells with more differentiatedphenotype from the same tumor (T-CD24+), and (B) the HOXA10, FOXC1,PACAP, and LHX1 genes from matched primary tumors (indicated by a star)and distant metastases (DM) collected from different organs. The averageCt value for each gene was normalized against the RPL39 value (seeExample 1). The data (“Relative expression (%)”) are percentagesrelative to the RPL39 value.

FIG. 30 is a depiction of the nucleotide sequence (SEQ ID NO:11) of aregion of the FNDC1 gene containing the relevant AscI recognitionsequence (in bold and underlined) and multiple CpG dinucleotides(shaded). The sequenced segment spans bp −285 to bp −614 (relative tothe FNDC1 gene transcription initiation site) and thus the wholesequenced segment is within the promoter region.

FIG. 31 is a depiction of the nucleotide sequence (SEQ ID NO:12) of aregion of the FOXC1 gene containing the relevant AscI recognitionsequence (in bold and underlined) and multiple CpG dinucleotides(shaded). The sequenced segment spans bp 5250 to bp 4976 (relative tothe FOXC1 gene transcription initiation site) and thus the wholesequenced segment is within the promoter region.

FIG. 32 is a depiction of the nucleotide sequence (SEQ ID NO:13) of aregion of the PACAP gene containing the relevant AscI recognitionsequence (in bold and underlined) and multiple CpG dinucleotides(shaded). The sequenced segment spans bp 4404 to bp 4736 (relative tothe PACAP gene transcription initiation site) and thus the wholesequenced segment is within the promoter region.

FIG. 33 is a depiction of the nucleotide sequence (SEQ ID NO:14) of aregion of the DDN gene containing the relevant AscI recognition sequence(in bold and underlined) and multiple CpG dinucleotides (shaded). Thesequenced segment spans bp 2108 to bp 2290 (relative to the PACAP genetranscription initiation site) and thus the whole sequenced segment iswithin exon 2.

FIG. 34 is a depiction of the nucleotide sequence (SEQ ID NO:15) of aregion of the LHX1 gene containing the relevant AscI recognitionsequence (in bold and underlined) and multiple CpG dinucleotides(shaded). The sequenced segment spans bp 3600 to bp 3810 (relative tothe LHX1 gene transcription initiation site) and thus the wholesequenced segment is within introns 3-4.

FIG. 35 is a depiction of the nucleotide sequence (SEQ ID NO:16) of aregion of the SOX13 gene containing the relevant AscI recognitionsequence (in bold and underlined) and multiple CpG dinucleotides(shaded). The sequenced segment spans bp 669 to bp 374 (relative to theSOX13 gene transcription initiation site) and thus the whole sequencedsegment is within the promoter area.

FIG. 36 is a depiction of the nucleotide sequence (SEQ ID NO:17) of aregion of the DTX gene containing the relevant AscI recognition sequence(in bold and underlined) and multiple CpG dinucleotides (shaded). Thesequenced segment spans bp 228 to bp 551 (relative to the DTX genetranscription initiation site) and thus the whole sequenced segment iswithin the promoter area.

FIG. 37 is a depiction of the nucleotide sequence (SEQ ID NO:18) of aregion of the HOXA10 gene containing the relevant AscI recognitionsequence (in bold and underlined) and multiple CpG dinucleotides(shaded). The sequenced segment spans bp 4270 to bp 4634 (relative tothe HOXA10 gene transcription initiation site) and thus the wholesequenced segment is within the promoter area.

FIG. 38 is a depiction of the nucleotide sequence (SEQ ID NO:1543) of aregion of the SLC9A3R1 gene containing the relevant AscI recognitionsequence (in bold and underlined) and multiple CpG dinucleotides(shaded). The sequenced segment spans bp 11713 to bp 11978 (relative tothe SLC9A3R1 gene transcription initiation site) and thus the wholesequenced segment is within introns 1-2.

FIG. 39 is a depiction of the nucleotide sequence (SEQ ID NO:11544) of aregion of the CDC42Ep5 gene containing the relevant AscI recognitionsequence (in bold and underlined) and multiple CpG dinucleotides(shaded). The sequenced segment spans bp 7855 to bp 8058 (relative tothe CDC42Ep5 gene transcription initiation site) and thus the wholesequenced segment is within exon 3.

DETAILED DESCRIPTION

Various aspects of the invention are described below.

Methylation Specific Digital Karyotyping (MSDK)

MSDK is a method of assessing the relative level of methylation of anentire genome, or part of a genome, of a cell of interest. The cell canbe any DNA-containing biological cell in which the DNA is subject tomethylation, e.g., prokaryotic cells (e.g., bacteria) or eukaryoticcells (e.g., yeast cells, protozoan cells, invertebrate cells, orvertebrate (e.g., mammalian) cells).

Vertebrate cells can be from any vertebrate species, e.g., reptiles(e.g., snakes, alligators, and lizards), amphibians (e.g., frogs andtoads), fish (e.g., salmon, sharks, or trout), birds (e.g., chickens,turkeys, eagles, or ostriches), or mammals. Mammals include, forexample, humans, non-human primates (e.g., monkeys, baboons, orchimpanzees), horses, bovine animals (e.g., cows, oxen, or bulls),whales, dolphins, porpoises, pigs, sheep, goats, cats, dogs, rabbits,gerbils, guinea pigs, hamsters, rats, or mice. Vertebrate and mammaliancells can be any nucleated cell of interest, e.g., epithelial cells(e.g., keratinocytes), myoepithelial cells, endothelial cells,fibroblasts, melanococytes, hematological cells (e.g., macrophages,monocytes, granulocytes, T lymphocytes (e.g., CD4+ and CD8+lymphocytes), B-lymphocytes, natural killer (NK) cells, interdigitatingdendritic cells), nerve cells (e.g., neurons, Schwann cells, glialcells, astrocytes, or oligodendrocytes), muscle cells (smooth andstriated muscle cells), chondrocytes, osteocytes. Also of interest arestem cells, progenitor cells, and precursor cells of any of theabove-listed cells. Moreover the method can be applied to malignantforms of any of cells listed herein.

The cells can be of any tissue or organ, e.g., skin, eye, peripheralnervous system (PNS; e.g., vagal nerve), central nervous system (CNS;e.g., brain or spinal cord), skeletal muscle, heart, arteries, veins,lymphatic vessels, breast, lung, spleen, liver, pancreas, lymph node,bone, cartilage, joints, tendons, ligaments, gastrointestinal tissue(e.g., mouth, esophagus, stomach, small intestine, large intestine(e.g., colon or rectum)), genitourinary system (e.g., kidney, bladder,uterus, vagina, ovary, ureter, urethra, prostate, penis, testis, orscrotum). Cancer cells can be of any of these organs and tissues andinclude, without limitation, breast cancers (any of the types and gradesrecited herein), colon cancer, prostate cancer, lung cancer, pancreaticcancer, melanoma.

MSDK can be performed on an entire genome of a cell, e.g., whole DNAextracted from an entire cell or the nucleus of a cell. Alternatively,it can be carried out on part of a cell, e.g., by extracting DNA frommutant cells lacking part of a genome, chromosome microdissection, orsubtractive/differential hybridization. The method is performed ondouble-stranded DNA and, unless otherwise stated, in describing MSDK,the term “DNA” refers to double-stranded DNA.

Method of Making a MSDK Library

In the first step of the MSDK, genomic DNA is exposed to amethylation-sensitive mapping restriction enzyme (MMRE) that cuts theDNA at sites having the recognition sequence for the relevant MMRE. TheMMRE can be any MMRE. In eukaryotic cells, methylation generally occursat C nucleotides in CpG dinucleotide sequences in DNA. The term “CpG”refers to dinucleotide sequences that occur in DNA and consist of a Cnucleotide and G nucleotide immediately 3′ of the C nucleotide. The “p”in “CpG” denotes the phosphate group that occurs between the C and Gnucleoside residues in the CpG dinucleotide sequence.

The MMRE recognition sequence can contain one, two, three, or four Cresidues that are susceptible to methylation. If one (or more) of the Cresidues in a MMRE recognition sequence is methylated, the MMRE does notcut the DNA at the relevant MMRE recognition sequence Examples of usefulMMRE include, without limitation, AscI, AatII, AciI, AfeI, AgeI, AsisIAvaI, BceAI, BssHI, ClaI, EagI, Hpy99I, MluI, NarI, NotI, SacII, orZraAI The AscI recognition sequence is GGCGCGCC and thus contains twomethylation sites (CpG sequences). If either one or both is methylated,the recognition site is not cut by AscI. There are approximately 5,000AscI recognition sites per human genome.

Exposure of the genomic DNA to the MMRE results in a plurality of firstfragments, the absolute number of which will depend on the relativenumber of MMRE recognition sites that are methylated. The more that aremethylated, the fewer first fragments will result. Most of the firstfragments will have at one terminus the MMRE 5′ cut sequence (seedefinition below) and at the other terminus the MMRE 3′ cut sequence(see definition below). For each chromosome, two fragments with MMRE cutsequences at only one terminus will be generated; these first fragmentsare referred to herein as terminal first fragments. One such terminalfirst fragment contains the 5′ terminus of the chromosome at one end anda MMRE 3′ cut sequence at the other end and the other terminal fragmentcontains the 3′ terminus of the chromosome at one end and a MMRE 5′ cutsequence at the other end.

As used herein, a “5′ cut sequence” of a restriction enzyme that cutsDNA within the restriction enzyme's recognition sequence is the portionof the restriction enzyme's recognition sequence at the 5′ end of afragment containing the 3′ end of the restriction enzyme recognitionsequence that is generated by cutting of DNA by the restriction enzyme.As used herein, a “3′ cut sequence” of a restriction enzyme that cutsDNA within the restriction enzyme's recognition sequence is the portionof the restriction enzyme's recognition sequence at the 3′ end of afragment containing the 5′ end of the restriction enzyme recognitionsequence that is generated by cutting of DNA by the restriction enzyme.5′ and 3′ cut restriction enzyme cut sequences are illustrated in FIG.1.

To the termini of the first fragments are conjugated a first member ofan affinity pair (see definition in Summary section), e.g., biotin oriminobiotin. This can be achieved by, for example, ligating to the MMRE5′ and 3′ cut sequence-containing termini a binding moiety. The bindingmoiety contains the first member of the affinity pair conjugated (e.g.,by a covalent bond or any other stable chemical linkage, e.g., acoordination bond, that can withstand the relatively mild chemicalconditions of the MSDK methodology) to either a MMRE 5′ cut sequence ora MMRE 3′ cut sequence. The majority of the fragments (referred toherein as second fragments) resulting from attachment by this method ofthe first members of the affinity pair will have first members of anaffinity pair bound to both their termini. Second fragments resultingfrom terminal first fragments will of course have first members of theaffinity pair only at one terminus, i.e., the terminus containing theMMRE cut sequence.

The binding moiety can, optionally, also contain a linker (or spacer)nucleotide sequence of any convenient length, e.g., one to 100 basepairs (bp), three to 80 bp, five to 70 bp, seven to 60 bp, nine to 50,or 10 to 40 bp. The linker (or spacer) can be, for example, 30, 31, 32,33, 34, 35, 26, 37, 38, or 40 bp long. As will be apparent, the linkermust not include a fragmenting restriction enzyme (see below)recognition sequence.

Instead of using the above-described binding moiety to attach the firstmembers of an affinity pair to the termini of first fragments, theattachment can be done by any of a variety of chemical means known inthe art. In this case, the first member of an affinity pair canoptionally contain a functional chemical group that facilitates bindingof the first member of the affinity pair to the termini of the firstfragments. It will be appreciated that by using this “chemical method”,it is possible to attach first members of an affinity pair to both endsof terminal first fragments. Naturally, using the chemical method it isalso possible to include the above-described linker (or spacer)nucleotide sequences. Where a functional chemical group is attached tothe first member of the affinity pair, the linker (or spacer) nucleotidesequence is located between the first member of the affinity pair andthe chemical functional group.

The second fragments are then exposed to fragmenting restriction enzyme(FRE). The FRE can be any restriction enzyme whose recognition sequenceoccurs relatively frequently in the genomic DNA of interest. Thus,restriction enzymes having four nucleotide recognition sequence areparticularly desirable as FRE. In addition, the FRE should not besensitive to methylation, i.e., its recognition sequence, at least ineukaryotic DNA should not contain a CpG dinucleotide sequence.Preferably, the FRE recognition sequence should occur at least 10 (e.g.,at least: 20; 50; 100; 500; 1,000; 2,000; 5,000; 10,000; 25,000; 50,000;100,000; 200,000; 500,000; 10⁶; or 10⁷) times more frequently in thegenome than does the MMRE recognition sequence. Examples of useful FREwhose recognition sequences consist of four nucleotides include, withoutlimitation, AluI, BfaI, CviAII, FatI, HpyCH4V, MseI, NlaIII, or Tsp509I.The recognition sequence for NlaIII is CATG. Exposure of the secondfragments to the FRE results in a large number of fragments, themajority of which will have FRE cut sequences at both of their terminiand a relatively few with a FRE cut sequence (5′ or 3′) at one end andthe first member of the affinity pair (corresponding to a MMRE cutsequence) at the other end. The latter fragments are referred to hereinas third fragments.

The third fragments are then exposed to a solid substrate having boundto it the second member of the affinity pair (e.g., avidin,streptavidin, or a functional fragment of either; see Summary sectionfor examples of other useful second members) corresponding to the firstmember of the affinity pair in the third fragments. The third fragmentsbind, via the physical interaction between the first and second membersof the affinity pair, to the solid substrate. The solid substrate can beany insoluble substance such as plastic (e.g., plastic microtiter wellor petri plate bottoms), metal (e.g., magnetic metallic beads), agarose(e.g., agarose beads), or glass (e.g., glass beads or the bottom of aglass vessel such as a glass beaker, test tube, or flask) to which thethird fragments can bind and thus be separated from fragments notcontaining the first member of the affinity pair.

Fragments not bound to the solid substrate are removed from the mixtureand the solid substrate is optionally rinsed or washed free of anynon-specifically bound material. The third fragments bound to the solidsubstrate are referred to as bound third fragments.

The terminus of the bound third fragment not bound to the solidsubstrate (referred to herein as the free terminus) is then conjugatedto a releasing restriction enzyme (RRE) (also referred to hereinsometimes as a tagging enzyme) recognition sequence. This can beachieved by, for example, ligating to the free termini (containing a FRE5′ or 3′ cut sequence) releasing moieties containing the FRE 5′ or 3 cutsequence and, 5′ of the cut sequence, the RRE recognition sequence.Restriction enzymes useful as RRE are those that cut DNA at specificdistances (depending on the particular type IIs restriction enzyme) fromthe recognition sequence, e.g., without limitation, the type IIs andtype II. An example of a useful RRE is MmeI that has the followingnon-palindromic recognition sequence: 5′-TCCPuAC, 3′-AGGPyTG (Pu,purine; Py, pyrimidine) and cuts DNA after the twentieth nucleotidedownstream of the TCCPuAc sequence [Boyd et al. (1986) Nucleic AcidsRes. 14(13): 5255-5274]. Other useful type IIs restriction enzymesinclude, without limitation, BsnfI, FokI, and AlwI, and useful type IIBrestriction enzymes include, without limitation, BsaXI, CspCI, AloI,PpiI, and others listed in Tengs et al. [(2004) Nucleic Acids Research32(15):e21(pages 1-9)], the disclosure of which is incorporated hereinby reference in its entirety.

Releasing moieties can optionally contain, immediately 5′ of the RRErecognition sequence, additional nucleotides as an extending sequence.The extending sequence can be of any convenient length, e.g., one to 100bp, three to 80 bp, five to 70 bp, seven to 60 bp, nine to 50, or 10 to40 bp. The extending sequence can be, for example, 20, 21, 22, 23, 24,25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 26, 37, 38, or 40 bp long.

Conjugating the RRE recognition sequence to the free termini of thebound third fragments results in bound fourth fragments that (a) haveRRE recognition sequences at their free termini, and (b) are bound bythe first and second members of the affinity pair to the solidsubstrate. The bound fourth fragments are then exposed to the RRE whichcuts the bound fourth fragments at a position that is characteristic ofthe relevant RRE. In the case of the MmeI RRE, the bound fourth fragmentis cut on the downstream side of the twentieth nucleotide after theterminal C residue of the TCCPuAC recognition sequence. The exposureresults in the release from the solid substrates of a library of fifthfragments. Each of the fifth fragments contains the RRE recognitionsequence (and extending sequence if used) and a plurality of bp of thetest genomic DNA, including the FRE recognition sequence closest to anunmethylated MMRE recognition sequence. The absolute number of these bpof the test genomic DNA in the fifth fragments will vary from one RRE toanother and is, in the case of MmeI, 20 nucleotides. The sequence ofgenomic DNA in the fifth fragment (but without the FRE recognitionsequence) is referred to herein as a MSDK tag. Since the MmeI and NlaIIIrecognition sequences overlap by one nucleotide, the tags generatedusing MmeI as the RRE and NlaIII as the FRE are 17 nucleotides long.

The greater the number of bp between the RRE recognition sequence andthe cutting site of the RRE, the longer the MSDK tags will be. Thelonger the MSDK tags are, the lower the chances of redundancy due to aplurality of occurrences of the tag sequence in the genome of interestwill be. In addition, it will be appreciated that the number of bpbetween FRE recognition sequences and corresponding MMRE recognitionsequences in the genomic DNA of interest will optimally be greater thanthe number of bp between the RRE recognition sequence and the RRE cutsite. However problems arising due to this criterion not being met canbe obviated by using the binding moiety method of attaching a firstmember of an affinity pair to first fragment termini and including inthe binding moiety a linker (or spacer) nucleotide sequence ofappropriate length (see above); the shorter the distance between the anygiven FRE recognition sequence and a corresponding MMRE recognitionsequence in a genome being analyzed, the longer the linker (or spacer)nucleotide sequence would need to be.

Methods of Using a MSDK Tag Library

MSDK libraries generated as described above can be used for a variety ofpurposes.

The first step in most of such methods would be to at least identify thenucleotide sequences of as many MSDK tags obtained in making a libraryas possible. There are many ways in which this could be done which willbe apparent to those skilled in the art. For example, array technologyor the MPSS (massively parallel signature sequencing) method could beexploited for this purpose. Alternatively, the MSDK tag-containing fifthfragments (see above) can be cloned into sequencing vectors (e.g.,plasmids) and sequenced using standard sequencing techniques, preferablyautomated sequencing techniques.

The inventors have used a technique for identifying MSDK tag sequences(see Example 1 below) adapted from the Sequential Analysis of GeneExpression (SAGE) technique [Porter et al. (2001) Cancer Res.61:5697-5702; Krop et al. (2001) Proc. Natl. Acad. Sci. U.S.A98:9796-9801; Lal et al. (1999) Cancer Res. 59:5403-5407; and Boon etal. (2002) Proc. Natl. Acad. Sci. U.S.A. 99:11287-11292]. This adaptedtechnique involves:

(a) adding a DNA ligase enzyme to a library of fifth fragments andthereby ligating pairs of fifth fragments having cohesive RRE-derivedends together to form fifth fragment dimers (also referred to herein as“ditags”);

(b) increasing the numbers of individual ditags by PCR using primerswhose sequences correspond to nucleotide sequences in extender sequencesderived from a releasing moiety (see above);

(c) digesting the PCR-amplified ditags with the FRE used to generate theMSDK library and thereby generating digested ditags lacking the RRE siteand extender sequences (if used);

(d) concatamerizing (polymerizing) the ditags using a ligase enzyme(e.g., T4 ligase) to create ditag multimers;

(e) cloning the ditag multimers into sequencing vectors and sequencingthe inserts (e.g., by automatic sequencing methods); and

(f) deducing from the ditag multimer sequences the sequences ofindividual MSDK tags.

One of skill in the art will naturally know of ways to modify and adaptthe above tag identification procedure to his or her particularrequirements. For example, one or more of the steps (e.g., step (b), theditag amplification step or step (c), the step that removes the RRErecognition site and any extender sequence used) could be omitted.

Having obtained the sequences of some or all of the MSDK tags, there area number of analyses that could be pursued.

Enumeration of MSDK Tags

The numbers of each tag, or a subgroup of tags, in a MSDK library can becomputed. Then, for example, optionally having normalized the number ofeach to the total number of cloned tag sequences obtained, the resultingMSDK profile (consisting of a list of MSDK tags and the abundance(number) of each MSDK tag) can be compared to corresponding MSDKprofiles obtained with other cells of interest. In computing the totalnumbers of individual MSDK tags, where ditags have been amplified by PCR(step (b) above), ditag replicates are deleted from the analysis. Sincethe chance of any one ditag combination occurring more than once as aresult of step (a) above would be extremely low, replicate ditags wouldlikely be due to the PCR amplification procedure. Ways to estimate thenumbers of individual tag sequences include the same methods describedabove for identifying the tag sequences.

The relative abundance (number) of a given MSDK tag obtained gives anindication of the relative frequency at which the nearest MMRErecognition sequence to the FRE recognition sequence associated with thegiven tag is unmethylated. The higher the number of the MSDK tagobtained, the more frequently that MMRE recognition sequence isunmethylated. Because, by the nature of the method, any given MMRErecognition sequence is correlated with a MSDK tag associated with thenearest FRE recognition sequence upstream of it and with the nearest FRErecognition sequence downstream of it, if any two MMRE recognition sitesoccur without an appropriate FRE recognition site between them, it willalways be possible to discriminate the methylation status (methylated ornot methylated) of both the MMRE recognition sites. On the other hand ifthree MMRE recognition sites occur without an FRE recognition sequencebetween the first and third, it might not be possible to discriminatethe methylation status of the middle MMRE recognition sequence. However,the chances of this occurring can be reduced to essentially zero bychoosing a FRE that has a recognition sequence occurring in the genomicDNA of interest much more frequently than the selected MMRE. Indeedprior to the analysis, since generally the sequence of the genome ofinterest is known, this potential resolution-impairing eventuality canbe tested for in advance and overcome by examining the genomicnucleotide sequences and, if necessary, an alternative MMRE-FREcombination can be selected or a plurality of analyses can be performedusing a number of different MMRE-FRE combinations.

MSDK tag profiles composed of all the tag sequences obtained in an MSDKanalysis, and preferably (but not necessarily) the relative numbers ofall the MSDK tags, can be compared to corresponding profiles obtainedwith other cell types. Corresponding profiles will of course be thosegenerated using the same MMRE, FRE, and RRE and in at least anoverlapping part, if not an identical portion, of the relevant genome.Such comparisons can be used, for example, to identify a test cell ofinterest. For example, a test cell could be a cell of type x, type y, ortype z. The MSDK profile obtained with the test cell can be compared tocontrol corresponding MSDK profiles obtained from control cells of typex, type y, and type z. The test cell will likely be of the same type, orat least most closely related, to the control cell (type x, y, or z)whose MSDK profile the test cell's profile most closely resembles.Alternatively, the MSDK profile of a test cell can be compared to thatof a single control cell and, if the test cell's profile issignificantly different from that of the control cell's profile, it islikely to be of a different type than the control cell type. Statisticalmethods for doing the above-described analyses are known to thoseskilled in the art.

The number of MSDK tag species in any given MSDK tag profile variesgreatly depending on how many are available and their relativediscriminatory power. Indeed, where a particular MSDK tag candiscriminate specifically between two cell types of interest, the MSDKtag profile can contain it alone. Thus MSDK tag profiles can contain asfew as one MSDK tag. However, they will generally contain a plurality ofdifferent MSDK tags, e.g., at least: 2; 3; 4; 5; 6; 7; 8; 9, 10; 12; 15;20; 25; 30; 35; 40; 50; 60; 75; 85; 100; 120; 140; 160; 180; 200; 250;300; 350; 400; 450; 500; 600; 700; 800; 900; a 1,000; 2,000; 5,000;10,000; or even more tag species.

The range of “cell types” that can be compared in the above analyses isof course enormous. Thus, for example, the MSDK profile of a testbacterium can be compared to control MSDK profiles of bacteria of:various species of the same genus as the test bacterium (if its genus isknown but its species is to be defined); various strains of the samespecies as the test bacterium (if its species is known but its strain isto be defined) or even various isolates of the same strain as the testbacterium but from, for example, various ecological niches (if thestrain of the test bacterium, but not its ecological origin, is known).The same principle can be applied to any biological cell and to anylevel of speciation of a biological cell. Similarly the MSDK profiles ofeukaryotic (e.g., mammalian) test cells can be compared to correspondingMSDK profiles of control test cells of various tissues, of variousstages of development, and of various lineages. In addition, the MSDKprofile of a test vertebrate cell can be compared to one or more controlMSDK profiles of cells (of, for example, the same tissue as the testcell) that are normal or malignant in order to determine (diagnose)whether the test cell is a malignant cell. Moreover, the MSDK profile ofa cancer test cell can be compared to one or more control MSDK profilesof cancers of a variety of tissues in order to define the tissue originof the test cell. In addition, the MSDK profile of a test cell can becompared to that or those of (a) control test cell(s) that can beidentical to, or similar to or even different from, the test cell buthas/have been exposed or subjected to any of large number ofexperimental or natural influences, e.g., drugs, cytokines, growthfactors, hormones, or any other pharmaceutical or biological agents,physical influences (e.g., elevated and/or depressed temperature orpressure), or environmental conditions (e.g., drought or monsoonconditions). It will thus be appreciated that the term “cell type”covers a large variety of cells and that (or those) used or defined inany particular analysis will depend on the nature of analysis beingperformed. Those skilled in the art will be able to select appropriatecontrol cell types for the analyses of interest.

Examples of MSDK profiles useful as control test profiles are providedherein. Thus, for example, the MSDK profile of a test breast cell (e.g.,an epithelial cell, a myoepithelial cell, or a fibroblast) from a humansubject could be compared to the MSDK profiles of breast epithelialcells, myoepithelial cells, and fibroblast-enriched stromal cells fromboth control normal and control breast cancer (e.g., DCIS or invasivebreast cancer) subjects in order to establish whether the test breasttissue from which the test breast cell was obtained is cancerous breasttissue. Moreover, the MSDK profile of a test cancer cell can be comparedto those of control breast, prostate, colon, lung, and pancreatic cancercells as part of an analysis to establish the tissue of the test cancercell. In addition, the MSDK profile of a cell suspected of being eitheran epithelial or myoepithelial cell can be compared to those of controlnormal (and/or cancerous, depending on whether the test cell is normal,cancerous, or not yet established to be normal or cancerous) epithelialand myoepithelial cells in order to establish whether the test cell isan epithelial or myoepithelial cell.

Mapping of MMRE Recognition Sequences

Alternatively, or in addition to enumerating MSDK tags, once the tagsobtained in by the MSDK analysis have been identified, the locations inthe genome of interest corresponding to the tags (referred to herein as“genomic tag sequences) can be established by comparison of the tagsequences to the nucleotide sequence of the genome (or part of thegenome) of interest. This can be done manually but is preferably done bycomputer. The relevant genomic sequence information can be loaded intothe computer from a medium (e.g., a computer diskette, a CD ROM, or aDVD) or it can be downloaded from a publicly available internetdatabase.

One method by which the genomic tag sequences can be identified is byfirst creating a “virtual” tag library using the following information:(a) the nucleotide sequence of the genome (or part of the genome) ofinterest; (b) the nucleotide sequence of the MMRE recognition sequence;(c) the nucleotide sequence of the FRE recognition sequence; and (d) thenumber of nucleotides separating the RRE recognition sequence from theRRE cutting site. Optimally, virtual tag sequences that are not unique(i.e. that could arise in a MSDK library from more than one geneticlocus) are deleted from the virtual MSDK library. By comparing thesequences of the tags obtained in the test MSDK analysis to the virtualtag library, it is possible to determine the genomic location of MSDKtags of interest, e.g., all the tags obtained by the analysis or one ormore of such tags.

Once the genomic location of the genomic tag sequences has beenobtained, it is a simple matter to identify genes in which, or close towhich, the genomic tag sequences are located. This step can be donemanually, but can also be done by a computer. Such genes can be thesubject of additional analyses, e.g., those described below.

Methods of Determining Levels of DNA Methylation

The invention features methods of assessing the level of methylation ofgenomic regions (e.g., genes or subregions of genes) of interest. Themethods can be applied to genomic regions identified by the MSDKanalyses described above or selected on any other basis, e.g., theobservation of differential expression of a gene in two cell types(e.g., a normal cell and a cancer cell of the same tissue as the normalcell) of interest.

The methods are of particular interest in the diagnosis of cancer. Inbroad terms, it has been claimed that the genomes of cancer cells arehypomethylated relative to corresponding normal cells [Feinberg et al.(1983) Nature 301:89-92]. Moreover, gene hypermethylation is frequentlyassociated with decreased expression of the relevant gene. However, atthe individual gene level these generalizations do not apply. Thus, forexample, some genes can be hypermethylated in cancer cells in comparisonto corresponding normal cells, hypermethylation of some genes isassociated with increased expression, and hypomethylation of some genesis associated with decreased expression of the relevant genes.Interestingly, in the examples below, it was observed thathypermethylation of the promoter region of one gene (Cxorf12) wasassociated with decreased expression of the gene, while hypermethylationof the exons and/or introns of three other genes (PRDM14, HOXD4, andCDC42EP5) was associated with increased expression of the genes.

As used herein, the term “gene” refers to a genomic region starting 10kb (kilobases) 5′ of a transcription initiation site and terminating 2kb 3′ of the polyA signal associated with the coding sequence within thegenomic region. Where the polyA signal of another gene is located lessthan 10 kb 5′ of the transcription initiation site of a gene ofinterest, for the purposes of the instant invention, the gene ofinterest is considered to start at the first nucleotide immediatelyafter the polyA signal of the other gene. Moreover, where atranscription initiation site of another gene is less than 2 kb 3′ primeof the polyA signal of the gene of interest, for the purposes of theinstant invention, the gene of interest terminates at the nucleotideimmediately before the transcription initiation site of the other gene.From these definitions it will be appreciated that, as used herein,promoter regions and regions 3′ of polyA signals of adjacent genes canoverlap.

As used herein, the “promoter region” of a gene refers to a genomicregion starting 10 kb 5′ of a transcription initiation site andterminating at the nucleotide immediately 5′ of the transcriptioninitiation site. Where a polyA signal of another gene is located lessthan 10 kb 5′ of the transcription initiation site of a gene ofinterest, for the purposes of the instant invention, the promoter regionof the gene of interest starts at the first nucleotide immediatelyfollowing the polyA signal of the other gene.

As used herein, the terms “exons” and “introns” refer to amino acidcoding and non-coding, respectively, nucleotide sequences occurringbetween the transcription initiation site and start of the polyAsequence of a gene.

As used herein, a “CpG island” is a sequence of genomic DNA in which thenumber of CpG dinucleotide sequences is significantly higher than theiraverage frequency in the relevant genome. Generally, CpG islands are notgreater than 2,000 (e.g., not greater than: 1,900; 1,800; 1,700; 1,600;1,500; 1,400; 1,300; 1,200; 1,100; 1,000; 900; 800; 700; 600; 500; 400;300; 200; 100; 75; 50; 25; or 15) bp long. They will generally containnot less than one CpG sequence to every 100 (e.g., every: 90; 80; 70;60; 50; 40; 35; 30; 25; 20; 15; 10; or 5) bp in sequence of DNA. CpGislands can be separated by at least 20 (i.e., at least: 20; 35; 50; 60;80; 100; 150; 200; 250; 300; 350; or 500) bp of genomic DNA.

In the methods of the invention, the degree of methylation of one ormore C residues (in CpG sequences) in a gene of a test cell isdetermined. This degree of methylation can then be compared to that inone or more (e.g., two, three, four, five, six, seven, eight, nine, ten,11, 12, 15, 18, 20, 25, 30, 35, 40, 50, 75, 100, 200, or more) controlcells.

If the level of methylation in the test cell is altered compared to, forexample, that of a control cell, the test cell is likely to be differentfrom the control cell. For example, the test cell can be a cell from anyof the vertebrate tissues recited herein, the control cell can be anormal of that tissue, and the gene can be any one that isdifferentially methylated in cells from cancerous versus normal tissue(e.g., any of the genes listed in Tables 2, 5, 7, 8, 10, 12 and 15). Ifthe degree of methylation of the gene in the test cell is different fromthat in the normal cell, the test cell is likely to be a cancer cell.

Alternatively, the level of methylation in the test cell can be comparedto that in two more (see above) control cells. The cell will be the sameas, or most closely related to, the control cell in which the degree ofmethylation is the same as, or most closely resembles, that of the testcell.

The whole of a gene or parts of a gene (e.g., the promoter region, thetranscribed regions, the translated region, exons, introns, and/or CpGislands) can be analyzed.

Test and control cells can be the same as those listed above in thesection on MSDK. Genes that can analyzed can be any gene differentlymethylated in two or more cell types of interest. In the methods of theinvention any number of genes can be analyzed in order to characterize atest cell of interest. Thus, one, two, three, four, five, six, seven,eight, nine, ten, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 22, 25, 28,30, 35, 40, 45, 50, 60, 70, 80, 80, 100, 200, 500, or even more genescan be analyzed. The genes can be, for example, any of the DNA sequences(e.g., the genes) listed in Tables 2, 5, 7, 8, 10, 12, 15, and 16. Theentire genes or one more subregions of the genes (e.g., all or parts ofpromoter regions, all or parts of transcribed regions, exons, introns,and regions 3′ of polyA signals) can be analyzed

Specific genes of interest include, for example, the LMX-14, COL5A,LHX3, TCF7L1, PRDM14, ZCCHC14, HOXD4, SLC9A3R1, CDC42EP5, Cxorf12,LOC389333, SOX13, SLC9A3R1, FNDC1, FOXC1, PACAP, DDN, CDC42EP5, LHX1,and HOXA10 genes.

Methylation levels of one or more of these DNA sequences (e.g., genes)can be used to determine, for example, whether a test epithelial cellfrom breast tissue is a normal or cancerous epithelial cell (e.g., aDCIS (high, intermediate, or low grade) or invasive breast cancer cell).Particularly useful for such determinations are the PRDM14 and ZCCHC14genes. For example, with respect to the PRDM14 gene, a gene segment thatis or contains all or part of SEQ ID NO:1 (FIG. 6A) can be analyzed inorder to discriminate these cell types. Of particular interest for thispurpose are nucleotide sequences that include nucleotides: 8-17;341-392; 371-426; or 391-405 of SEQ ID NO:1. Methylation of the PRDM14can similarly be used to determine whether a test cell from, forexample, pancreas, lung, or prostate is a cancer cell or normal cell. Inaddition, with respect to the ZCCHC14 gene, a gene segment that is orcontains all or part of SEQ ID NO:2 (FIG. 17) can be analyzed in orderto discriminate these cell types. Of particular interest for thispurpose are nucleotide sequences that include nucleotides: 154-236;154-279; 154-293; or 154-299 of SEQ ID NO:2. Hypermethylation of thesegenes, and particularly hypermethylation of their coding regions, wouldindicate that the relevant test cells are cancer cells.

In addition, methylation levels of one or more of the above-listed genescan be used to determine, for example, whether a test epithelial cellfrom colon tissue is a normal or cancerous epithelial cell. Particularlyuseful for such determinations are the LHX3, TCF7L1, and LMX-1A genes.For example, with respect to the LHX3 gene, a gene segment that is orcontains all or part of SEQ ID NO:3 (FIG. 6A) can be analyzed in orderto discriminate these cell types. Of particular interest for thispurpose are nucleotide sequences that include nucleotides: 667-778;739-788; 918-931; or 885-903 of SEQ ID NO:3. In addition, for example,with respect to the TCF7L1 gene, a gene segment that is or contains allor part of SEQ ID NO:4 (FIG. 8A) can be analyzed in order todiscriminate these cell types. Of particular interest for this purposeare nucleotide sequences that include nucleotides: 708-737; 761-780;807-864; or 914-929 of SEQ ID NO:4. Moreover, for example, with respectto the LMX-1A gene, a gene segment that is or contains all or part ofSEQ ID NO:5 (FIG. 7A) can be analyzed in order to discriminate thesecell types. Of particular interest for this purpose are nucleotidesequences that include nucleotides: 849-878; 898-940; 948-999; or1,020-1039 of SEQ ID NO:5. Hypermethylation of these genes wouldindicate that the test cell is a cancerous colon epithelial cell.

Furthermore, methylation levels of the above-listed genes can beanalyzed to determine, for example, whether breast tissue from which atest myoepithelial is obtained is normal or cancerous breast tissue.Particularly useful for such determinations are the HOXD4, SLC9A3R1, andCDC42EP5 genes. For example, with respect to the HOXD4 gene, a genesegment that is or contains all or part of SEQ ID NO:6 (FIG. 18A) can beanalyzed in order to discriminate these cell types. Of particularinterest for this purpose are nucleotide sequences that includenucleotides: 185-255; 288-313; 312-362; or 328-362 of SEQ ID NO:6. Inaddition, for example, with respect to the SLC9A3R1 gene, a gene segmentthat is or contains all or part of SEQ ID NO:7 (FIG. 19A) can beanalyzed in order to discriminate these cell types. Of particularinterest for this purpose are nucleotide sequences that includenucleotides: 104-126; 104-247; 104-283; or 246-283 of SEQ ID NO:7.Moreover, for example, with respect to the CDC42EP5 gene, a gene segmentthat is or contains all or part of SEQ ID NO:8 (FIG. 21A) can beanalyzed in order to discriminate these cell types. Of particularinterest for this purpose are nucleotide sequences that includenucleotides: 181-247; 282-328; 336-359; or 336-390 of SEQ ID NO:8.Hypermethylation of these genes, and particularly their coding regions,would indicate that the test myoepithelial cell is from cancerous breasttissue.

Methylation levels of the above-listed genes can also be analyzed todetermine, for example, whether breast tissue from which a testfibroblast is obtained is normal or cancerous breast tissue.Particularly useful for such determinations is the Cxorf12 gene. Forexample, with respect to the either of these genes, a gene segment thatis or contains all or part of SEQ ID NO:9 (FIG. 22A) can be analyzed inorder to discriminate these cell types. Of particular interest for thispurpose nucleotide sequences that include nucleotides: 120-134; 159-201;206-247; or 293-313 of SEQ ID NO:9. Hypermethylation of these genes, andparticularly their promoter regions, would indicate that the testfibroblast is from cancerous breast tissue.

In addition, methylation levels of the above-listed genes can also beanalyzed to determine, for example, whether a test cell is an epithelialcell or a myoepithelial cell. Such assays can be applied to both normaland cancerous cells. Particularly useful for such determinations are theLOC389333 and CDC42EP5 genes. For example, with respect to the LOC389333gene, a gene segment that is or contains all or part of SEQ ID NO:10(FIG. 20A) can be analyzed in order to discriminate these cell types. Ofparticular interest for this purpose are nucleotide sequences thatinclude nucleotides: 306-330; 334-361; 373-407; or 415-484 of SEQ IDNO:10. With respect to the CDC42EP5 gene, examples of gene segments thatcan be analyzed include those described above for discriminating whethertissue from which a test myoepithelial was obtained was normal orcancerous. Significantly high levels of methylation of these genes wouldindicate that the test cell was an epithelial rather than amyoepithelial cell.

In addition, methylation levels of the above-listed genes can also beanalyzed to determine, for example, whether a test cell is a stem cell,or a differentiated cell derived therefrom, such as an epithelial cellor a myoepithelial cell. Such assays can be applied to both normal andcancerous cells. Particularly useful for such determinations are theSOX13, SLC9A3R1, FNDC1, FOXC1, PACAP, DDN, CDC42EP5, LHX1, and HOXA10genes. For example, with respect to the FOXC1 gene, a gene segment thatis or contains all or part of SEQ ID NO:12 (FIG. 27A) can be analyzed inorder to discriminate these cell types. In some cases, significantlyhigh levels of methylation of some of these genes would indicate thatthe test cell was a stem cell rather than a differentiated cell derivedtherefrom, (e.g., an epithelial or a myoepithelial cell).

Levels of methylation of C residues of interest can be assessed andexpressed in quantitative, semi-quantitative, or qualitative fashions.Thus they can, for example, be measured and expressed as discretevalues. Alternatively, they can be assessed and expressed using any of avariety of semi-quantitative/qualitative systems known in the art. Thus,they can be expressed as, for example, (a) one or more of “very high”,“high”, “average”, “moderate”, “low”, and/or “very low”; (b) one or moreof “++++”, “+++”, “++”, “+”, “+/−”, and/or “−”; (c) methylated or notmethylated (i.e., in a digital fashion); (d) ranges such as “0%-10%”,“11%-20%”, 21%-30%”, “31%-40%, etc. (or any convenient range intervals);(e) graphically, e.g., in pie charts.

Methods of measuring the degree of methylation of C residues in the CpGsequences are known in the art. Such methodologies include sequencing ofsodium bisulfite-treated DNA and methylation-specific PCR and aredescribed in the Examples below.

Standardizing methylation assays to discriminate between cell types ofinterest involves experimentation entirely familiar and routine to thosein the art. For example, the methylation status of gene Q in a samplecancer cells of interest obtained from a one or more patients and incorresponding normal cells from normal individuals or from the samepatients can be assessed. From such experimentation it will be possibleto establish a range of “cancer levels” of methylation and a range of“normal levels” of methylation of gene Q. Alternatively, the methylationstatus of gene Q in cancer cells of each patient can be compared to themethylation status of gene Q in normal cells (corresponding to thecancer cells) obtained from the same patient. In such assays, it ispossible that methylation of as few as one cytosine residue coulddiscriminate between cancer and non-cancer cells.

Other methods for quantitating methylation of DNA are known in the art.Such methods are based on: (a) the inability of methylation-sensitiverestriction enzymes to cleave sequences that contain one or moremethylated CpG sites [Issa et al. (1994) Nat. Genet. 7:536-540;Singer-Sam et al. (1990) Mol. Cell. Biol. 10:4987-4989; Razin et al.(1991) Microbiol. Rev. 55:451-458; Stoger et al. (1993) Cell 73:61-71];and (b) the ability of bisulfite to convert cytosine to uracil and thelack of this ability of bisulfite on methylated cytosine [Frommer et al.(1992) Proc. Natl. Acad. Sci. USA 89:1827-1831; Myöhanen et al. (1994)DNA Sequence 5:1-8; Herman et al. (1996) Proc. Natl. Acad. Sci. USA93:9821-9826; Gonzalgo et al. (1997) Nucleic Acids Res. 25:2529-2531;Sadri et al. (1996) Nucleic Acids Res. 24:5058-5059; Xiong et al. (1997)Nucleic Acids Res. 25:2532-2534].

Gene Expression Assays

Experiments described in the Examples herein show that in a first cellin which methylation of a gene is altered (increased or decreased)relative to a second cell, expression of the gene in the first cell isalso altered relative to the second cell. In addition, previous findingsand the data in the Examples indicate that alterations in methylationstatus, and hence also consequent alterations in expression, of certaingenes correlate with phenotypic changes in cells. These findings providethe basis for assays (e.g., diagnostic assays) to discriminate betweentwo or more cell types.

In the methods of the invention, the level of expression of a gene of atest cell determined. This level of expression can then be compared tothat in one or more (e.g., two, three, four, five, six, seven, eight,nine, ten, 11, 12, 15, 18, 20, 25, 30, 35, 40, 50, 75, 100, 200, ormore) control cells.

If the level of expression in the test cell is altered compared to, forexample, that of a control cell, the test cell is likely to be differentfrom the control cell. For example, the test cell can be a cell from anyof the vertebrate tissues recited herein, the control cell can be anormal cell of that tissue, and the gene can be one shown to bedifferentially methylated in cells from cancerous and normal tissue(e.g., any of the genes listed in Tables 2, 5, 7, 8, 10, 12, 15 and 16).If the level of expression of the gene in the test cell is differentfrom that in the normal cell, the test cell is likely to be a cancercell.

Alternatively, the level of expression in the test cell can be comparedto that in two more (see above) control cells. The cell will be the sameas, or most closely related to, the control cell in which the level ofexpression is the same as, or most closely resembles that of the testcell.

Test and control cells can be any of those listed above in the sectionon MSDK. Genes whose level of expression can be determined can be anygene differently methylated in two more cell types of interest. They canbe, for example, any of the genes listed in Tables 2, 5, 7, 8, 10, 12,15, and 16.

Specific genes of interest include the LMX-14, COL5A, LHX3, TCF7L1,PRDM14, ZCCHC14, HOXD4, SOX13, SLC9A3R1, CDC42EP5, Cxorf12, andLOC389333 genes.

Expression levels of one or more of these genes can be analyzed todetermine, for example, whether a test epithelial cell from breasttissue is a normal or cancerous epithelial cell (e.g., a DCIS (high,intermediate, or low grade) or invasive breast cancer cell).Particularly useful for such determinations are the PRDM14 and ZCCHC14genes. Moreover, expression of the PRDM14 can be used to test whether atest cell from prostate, pancreas, or lung tissue is a cancer cell.Thus, for example, enhanced expression of the PRDM14 gene, or alteredexpression of the ZCCHC14 gene, in the test breast epithelial cellcompared to a control normal breast epithelial cell would be anindication that the test epithelial cell is a cancer cell.

In addition, expression levels of one or more of the above-listed genescan be analyzed to determine, for example, whether a test epithelialcell from colon tissue is a normal or cancerous epithelial cell.Particularly useful for such determinations are the LHX3, TCF7L1, andLMX-1A genes. Altered expression of these genes in the test colonepithelial cell compared to a control normal control epithelial cellwould be an indication that the test colon epithelial cell is a cancercell.

Expression levels of one or more of the above-listed genes in a testmyoepithelial cell can be analyzed to determine, for example, whetherbreast tissue from which the test myoepithelial was obtained is normalor cancerous breast tissue. Particularly useful for such determinationsare the HOXD4, SLC9A3R1, and CDC42EP5 genes. Enhanced expression of, forexample, the HOXD4 and CSD42EP5 genes, or altered expression of theSLC9A3R1 gene, in the test myoepithelial cell compared to a controlmyoepithelial from control normal breast tissue, would indicate that thetest breast tissue is cancerous breast tissue.

Expression levels of one or more of the above-listed genes in a testfibroblast can also be analyzed to determine, for example, whetherbreast tissue from which the test fibroblast was obtained is normal orcancerous breast tissue. Particularly useful for such determinations isthe Cxorf12 gene. Expression, for example, of this gene at the same or agreater level than in a control fibroblast from control normal breasttissue would indicate that the breast tissue is not cancerous breasttissue.

In addition, expression levels of one or more of the above-listed genescan also be analyzed determine, for example, whether a test cell is anepithelial cell or a myoepithelial cell. Such assays can be applied toboth normal and cancerous cells. Particularly useful for suchdeterminations are the LOC3.89333 and CDC42EP5 genes. Expression ofthese genes in the test cell at level that is the same as or similar tothat of a control myoepithelial cell would be an indication that thetest cell is a myoepithelial cell. On the other hand, expression of thegenes in the test cell at level that is the same as or similar to thatof a control epithelial cell would be an indication that the test cellis an epithelial cell.

Levels of expression of genes of interest can be assessed and expressedin quantitative, semi-quantitative, or qualitative fashions. Thus theycan, for example, be measured and expressed as discrete values.Alternatively, they can be assessed and expressed using any of a varietyof semi-quantitative/qualitative systems known in the art. Thus, theycan be expressed as, for example, (a) one or more of “very high”,“high”, “average”, “moderate”, “low”, and/or “very low”; (b) one or moreof “++++”, “+++”, “++”, “+”, “+/−”, and/or “−”; (c) expressed or notexpressed (i.e., in a digital fashion): (d) ranges such as “0%-10%”,“11%-20%”, 21%-30%”, “31%-40%, etc. (or any convenient range intervals);or (e) graphically, e.g., in pie charts.

In the description below, a “gene X” represents any of the genes listedin Tables 2, 5, 7, 8, 10, and 12; mRNA transcribed from gene X isreferred to as “mRNA X”; protein encoded by gene X is referred to as“protein X”; and cDNA produced from mRNA X is referred to as “cDNA X”.It is understood that, unless otherwise stated, descriptions containingthese terms are applicable to any of the genes listed in Tables 2, 5, 7,8, 10, 12, 15 and 16, mRNAs transcribed from such genes, proteinsencoded by such genes, or cDNAs produced from the mRNAs.

In the assays of the invention either: (1) the presence of protein X ormRNA X in cells is tested for or their levels in cells are assessed; or(2) the level of protein X is assessed in a liquid sample such as a bodyfluid (e.g., urine, saliva, semen, blood, or serum or plasma derivedfrom blood); a lavage such as a breast duct lavage, lung lavage, agastric lavage, a rectal or colonic lavage, or a vaginal lavage; anaspirate such as a nipple aspirate; or a fluid such as a supernatantfrom a cell culture. In order to test for the presence, or measure thelevel, of mRNA X in cells, the cells can be lysed and total RNA can bepurified or semi-purified from lysates by any of a variety of methodsknown in the art. Methods of detecting or measuring levels of particularmRNA transcripts are also familiar to those in the art. Such assaysinclude, without limitation, hybridization assays using detectablylabeled mRNA X-specific DNA or RNA probes and quantitative orsemi-quantitative RT-PCR methodologies employing appropriate mRNA X andcDNA X-specific oligonucleotide primers. Additional methods forquantitating mRNA in cell lysates include RNA protection assays andserial analysis of gene expression (SAGE). Alternatively, qualitative,quantitative, or semi-quantitative in situ hybridization assays can becarried out using, for example, tissue sections or unlysed cellsuspensions, and detectably (e.g., fluorescently or enzyme) labeled DNAor RNA probes.

Methods of detecting or measuring the levels of a protein of interest incells are known in the art. Many such methods employ antibodies (e.g.,polyclonal antibodies or monoclonal antibodies (mAbs)) that bindspecifically to the protein. In such assays, the antibody itself or asecondary antibody that binds to it can be detectably labeled.Alternatively, the antibody can be conjugated with biotin, anddetectably labeled avidin (a protein that binds to biotin) can be usedto detect the presence of the biotinylated antibody. Combinations ofthese approaches (including “multi-layer” assays) familiar to those inthe art can be used to enhance the sensitivity of assays. Some of theseassays (e.g., immunohistological methods or fluorescence flow cytometry)can be applied to histological sections or unlysed cell suspensions. Themethods described below for detecting protein X in a liquid sample canalso be used to detect protein X in cell lysates.

Methods of detecting protein X in a liquid sample (see above) basicallyinvolve contacting a sample of interest with an antibody that binds toprotein X and testing for binding of the antibody to a component of thesample. In such assays the antibody need not be detectably labeled andcan be used without a second antibody that binds to protein X. Forexample, by exploiting the phenomenon of surface plasmon resonance, anantibody specific for protein X bound to an appropriate solid substrateis exposed to the sample. Binding of protein X to the antibody on thesolid substrate results in a change in the intensity of surface plasmonresonance that can be detected qualitatively or quantitatively by anappropriate instrument, e.g., a Biacore apparatus (Biacore InternationalAB, Rapsgatan, Sweden).

Moreover, assays for detection of protein X in a liquid sample caninvolve the use, for example, of: (a) a single protein X-specificantibody that is detectably labeled; (b) an unlabeled protein X-specificantibody and a detectably labeled secondary antibody; or (c) abiotinylated protein X-specific antibody and detectably labeled avidin.In addition, as described above for detection of proteins in cells,combinations of these approaches (including “multi-layer” assays)familiar to those in the art can be used to enhance the sensitivity ofassays. In these assays, the sample or an (aliquot of the sample)suspected of containing protein X can be immobilized on a solidsubstrate such as a nylon or nitrocellulose membrane by, for example,“spotting” an aliquot of the liquid sample or by blotting of anelectrophoretic gel on which the sample or an aliquot of the sample hasbeen subjected to electrophoretic separation. The presence or amount ofprotein X on the solid substrate is then assayed using any of theabove-described forms of the protein X-specific antibody and, whererequired, appropriate detectably labeled secondary antibodies or avidin.

The invention also features “sandwich” assays. In these sandwich assays,instead of immobilizing samples on solid substrates by the methodsdescribed above, any protein X that may be present in a sample can beimmobilized on the solid substrate by, prior to exposing the solidsubstrate to the sample, conjugating a second (“capture”) proteinX-specific antibody (polyclonal or mAb) to the solid substrate by any ofa variety of methods known in the art. In exposing the sample to thesolid substrate with the second protein X-specific antibody bound to it,any protein X in the sample (or sample aliquot) will bind to the secondprotein X-specific antibody on the solid substrate. The presence oramount of protein X bound to the conjugated second protein X-specificantibody is then assayed using a “detection” protein X-specific antibodyby methods essentially the same as those described above using a singleprotein X-specific antibody. It is understood that in these sandwichassays, the capture antibody should not bind to the same epitope (orrange of epitopes in the case of a polyclonal antibody) as the detectionantibody. Thus, if a mAb is used as a capture antibody, the detectionantibody can be either: (a) another mAb that binds to an epitope that iseither completely physically separated from or only partially overlapswith the epitope to which the capture mAb binds; or (b) a polyclonalantibody that binds to epitopes other than or in addition to that towhich the capture mAb binds. On the other hand, if a polyclonal antibodyis used as a capture antibody, the detection antibody can be either (a)a mAb that binds to an epitope to that is either completely physicallyseparated from or partially overlaps with any of the epitopes to whichthe capture polyclonal antibody binds; or (b) a polyclonal antibody thatbinds to epitopes other than or in addition to that to which the capturepolyclonal antibody binds. Assays which involve the use of a capture anddetection antibody include sandwich ELISA assays, sandwich Westernblotting assays, and sandwich immunomagnetic detection assays.

Suitable solid substrates to which the capture antibody can be boundinclude, without limitation, the plastic bottoms and sides of wells ofmicrotiter plates, membranes such as nylon or nitrocellulose membranes,polymeric (e.g., without limitation, agarose, cellulose, orpolyacrylamide) beads or particles. It is noted that protein X-specificantibodies bound to such beads or particles can also be used forimmunoaffinity purification of protein X.

Methods of detecting or for quantifying a detectable label depend on thenature of the label and are known in the art. Appropriate labelsinclude, without limitation, radionuclides (e.g., ¹²⁵I, ¹³¹I, ³⁵S, ³H,³²P, ³³P, or ¹⁴C), fluorescent moieties (e.g., fluorescein, rhodamine,or phycoerythrin), luminescent moieties (e.g., Qdot™ nanoparticlessupplied by the Quantum Dot Corporation, Palo Alto, Calif.), compoundsthat absorb light of a defined wavelength, or enzymes (e.g., alkalinephosphatase or horseradish peroxidase). The products of reactionscatalyzed by appropriate enzymes can be, without limitation,fluorescent, luminescent, or radioactive or they may absorb visible orultraviolet light. Examples of detectors include, without limitation,x-ray film, radioactivity counters, scintillation counters,spectrophotometers, calorimeters, fluorometers, luminometers, anddensitometers.

In assays, for example, to diagnose breast cancer, the level of proteinX in, for example, serum (or a breast cell) from a patient suspected ofhaving, or at risk of having, breast cancer is compared to the level ofprotein X in sera (or breast cells) from a control subject (e.g., asubject not having breast cancer) or the mean level of protein X in sera(or breast cells) from a control group of subjects (e.g., subjects nothaving breast cancer). A significantly higher level, or lower level(depending on whether the gene of interest is expressed at higher orlower level in breast cancer or associated stromal cells), of protein Xin the serum (or breast cells) of the patient relative to the mean levelin sera (or breast cells) of the control group would indicate that thepatient has breast cancer.

Alternatively, if a sample of the subject's serum (or breast cells) thatwas obtained at a prior date at which the patient clearly did not havebreast cancer is available, the level of protein in the test serum (orbreast cell) sample can be compared to the level in the prior obtainedsample. A higher level, or lower level (depending on whether the gene ofinterest is expressed at higher or lower level in breast cancer orassociated stromal cells) in the test serum (or breast cell) samplewould be an indication that the patient has breast cancer.

Moreover, a test expression profile of a gene in a test cell (or tissue)can be compared to control expression profiles of control cells (ortissues) previously established to be of defined category (e.g., DCISgrade, breast cancer stage, or state of differentiation). The categoryof the test cell (or tissue) will be that of the control cell (ortissue) whose expression profile the test cell's (or tissue's)expression profile most closely resembles. These expression profilecomparison assays can be used to compare any of the normal breast tissuewith any stage and/or grade of breast cancer recited herein and/or tocompare between breast cancer grades and stages. The genes analyzed canbe any of those listed in Tables 2, 5, 7, 8, 10, 12, 15, and 16 and thenumber of genes analyzed can be any number, i.e., one or more.Generally, at least two (e.g., at least: two; three; four; five; six;seven; eight; nine; ten; 11; 12; 13; 14; 15; 17; 18; 20; 23; 25; 30; 35;40; 45; 50; 60; 70; 80; 90; 100; 120; 150; 200; 250; 300; 350; 400; 450;500; or more) genes will be analyzed. It is understood that the genesanalyzed will include at least one of those listed herein but can alsoinclude others not listed herein.

One of skill in the art will appreciate from this description howsimilar “test level” versus “control level” comparisons can be madebetween other test and control samples described herein.

It is noted that the patients and control subjects referred to aboveneed not be human patients. They can be for example, non-human primates(e.g., monkeys), horses, sheep, cattle, goats, pigs, dogs, guinea pigs,hamsters, rats, rabbits or mice.

Arrays and Kits and Uses Thereof

The invention features an array that includes a substrate having aplurality of addresses. At least one address of the plurality includes acapture probe that binds specifically to any of the MSDK tags listed inTables 2, 5, 7, 8, 10, 12, 15, and 16, a nucleic acid X (e.g., a DNAsequence (AscI site) defined by the location of the MSDK tags listed inTables 2, 5, 7, 8, 10, 12, 15, and 16), or a protein X. The array canhave a density of at least, or less than, 10, 20 50, 100, 200, 500, 700,1,000, 2,000, 5,000 or 10,000 or more addresses/cm², and ranges between.In a preferred embodiment, the plurality of addresses includes at least10, 100, 500, 1,000, 5,000, 10,000, 50,000 addresses. In a preferredembodiment, the plurality of addresses includes equal to or less than10, 100, 500, 1,000, 5,000, 10,000, or 50,000 addresses. The substratecan be a two-dimensional substrate such as a glass slide, a wafer (e.g.,silica or plastic), a mass spectroscopy plate, or a three-dimensionalsubstrate such as a gel pad. Addresses in addition to address of theplurality can be disposed on the array.

An array can be generated by any of a variety of methods. Appropriatemethods include, e.g., photolithographic methods (see, e.g., U.S. Pat.Nos. 5,143,854; 5,510,270; and 5,527,681), mechanical methods (e.g.,directed-flow methods as described in U.S. Pat. No. 5,384,261),pin-based methods (e.g., as described in U.S. Pat. No. 5,288,514), andbead-based techniques (e.g., as described in PCT US/93/04145).

In one embodiment, at least one address of the plurality includes anucleic acid capture probe that hybridizes specifically to any of theMSDK tags listed in Tables 2, 5, 7, 8, 10, 12, 15, and 16, e.g., thesense or anti-sense (complement) strand of the tag sequences. Eachaddress of the subset can include a capture probe that hybridizes to adifferent region of the MSDK tag. Such an array can be useful, forexample, for detecting the presence and, optionally, assessing therelative numbers of one or more of the MSDK tags (or the complementsthereof) listed in Tables 2, 5, 7, 8, 10, 12, 15, and 16 in a sample,e.g., a MSDK tag library.

In another embodiment, at least one address of the plurality includes anucleic acid capture probe that hybridizes specifically to a nucleicacid X, e.g., the sense or anti-sense strand. Nucleic acids of interestinclude, without limitation, all or part of any of the genes identifiedby the tags listed in Tables 2, 5, 7, 8, 10, 12, 15, and 16, all or partof mRNAs transcribed from such genes, or all or part of cDNA producedfrom such mRNA. Each address of the subset can include a capture probethat hybridizes to a different region of a nucleic acid. Each address ofthe subset is unique, overlapping, and complementary to a differentvariant of gene X (e.g., an allelic variant, or all possiblehypothetical variants). The array can be used, for example, to sequencegene X, mRNA X, or cDNA X by hybridization (see, e.g., U.S. Pat. No.5,695,940) or assess levels of expression of gene X.

In another embodiment, at least one address of the plurality includes apolypeptide capture probe that binds specifically to protein X orfragment thereof. The polypeptide can be a naturally-occurringinteraction partner of protein X, e.g., a ligand for protein X whereprotein X if a receptor or a receptor for protein X where protein X isligand. Preferably, the polypeptide is an antibody, e.g., an antibodyspecific for protein X, such as a polyclonal antibody, a monoclonalantibody, or a single-chain antibody.

Antibodies can be polyclonal or monoclonal antibodies; methods forproducing both types of antibody are known in the art. The antibodiescan be of any class (e.g., IgM, IgG, IgA, IgD, or IgE) and be generatedin any of the species recited herein. They are preferably IgGantibodies. Recombinant antibodies, such as chimeric and humanizedmonoclonal antibodies comprising both human and non-human portions, canalso be used in the methods of the invention. Such chimeric andhumanized monoclonal antibodies can be produced by recombinant DNAtechniques known in the art, for example, using methods described inRobinson et al., International Patent Publication PCT/US86/02269; Akiraet al., European Patent Application 184,187; Taniguchi, European PatentApplication 171,496; Morrison et al., European Patent Application173,494; Neuberger et al., PCT Application WO 86/01533; Cabilly et al.,U.S. Pat. No. 4,816,567; Cabilly et al., European Patent Application125,023; Better et al. (1988) Science 240, 1041-43; Liu et al. (1987) J.Immunol. 139, 3521-26; Sun et al. (1987) PNAS 84, 214-18; Nishimura etal. (1987) Canc. Res. 47, 999-1005; Wood et al. (1985) Nature 314,446-49; Shaw et al. (1988) J. Natl. Cancer Inst. 80, 1553-59; Morrison,(1985) Science 229, 1202-07; Oi et al. (1986) BioTechniques 4, 214;Winter, U.S. Pat. No. 5,225,539; Jones et al. (1986) Nature 321, 552-25;Veroeyan et al. (1988) Science 239, 1534; and Beidler et al. (1988) J.Immunol. 141, 4053-60.

Also useful for the arrays of the invention are antibody fragments andderivatives that contain at least the functional portion of theantigen-binding domain of an antibody. Antibody fragments that containthe binding domain of the molecule can be generated by known techniques.Such fragments include, but are not limited to: F(ab′)₂ fragments thatcan be produced by pepsin digestion of antibody molecules; Fab fragmentsthat can be generated by reducing the disulfide bridges of F(ab′)₂fragments; and Fab fragments that can be generated by treating antibodymolecules with papain and a reducing agent. See, e.g., NationalInstitutes of Health, 1 Current Protocols In Immunology, Coligan et al.,ed. 2.8, 2.10 (Wiley Interscience, 1991). Antibody fragments alsoinclude Fv fragments, i.e., antibody products in which there are few orno constant region amino acid residues. A single chain Fv fragment(scFv) is a single polypeptide chain that includes both the heavy andlight chain variable regions of the antibody from which the scFv isderived. Such fragments can be produced, for example, as described inU.S. Pat. No. 4,642,334, which is incorporated herein by reference inits entirety. For a human subject, the antibody can be a “humanized”version of a monoclonal antibody originally generated in a differentspecies.

In another aspect, the invention features a method of analyzing theexpression of gene X. The method includes providing an array asdescribed above; contacting the array with a sample and detectingbinding of a nucleic acid X or protein X to the array. In oneembodiment, the array is a nucleic acid array. Optionally the methodfurther includes amplifying nucleic acid from the sample prior or duringcontact with the array.

In another embodiment, the array can be used to assay gene expression ina tissue to ascertain tissue specificity of genes in the array,particularly the expression of gene X. If a sufficient number of diversesamples is analyzed, clustering (e.g., hierarchical clustering, k-meansclustering, Bayesian clustering and the like) can be used to identifyother genes which are co-regulated with gene X. For example, the arraycan be used for the quantitation of the expression of multiple genes.Thus, not only tissue specificity, but also the level of expression of abattery of genes in the tissue is ascertained. Quantitative data can beused to group (e.g., cluster) genes on the basis of their tissueexpression per se and level of expression in that tissue.

For example, array analysis of gene expression can be used to assessgene X expression in one or more cell types (see above).

In another embodiment, the array can be used to monitor expression ofone or more genes in the array with respect to time. For example,samples obtained from different time points can be probed with thearray. Such analysis can identify and/or characterize the development ofa gene X-associated disease or disorder (e.g., breast cancer such asinvasive breast cancer); and processes, such as a cellulartransformation associated with a gene X-associated disease or disorder.The method can also evaluate the treatment and/or progression of a geneX-associated disease or disorder

The array is also useful for ascertaining differential expressionpatterns of one or more genes in normal and abnormal (e.g., malignant)cells. This provides a battery of genes (e.g., including gene X) thatcould serve as a molecular target for diagnosis or therapeuticintervention.

In another aspect, the invention features a method of analyzing aplurality of probes. The method is useful, e.g., for analyzing geneexpression. The method includes: providing a first two dimensional arrayhaving a plurality of addresses, each address (of the plurality) beingpositionally distinguishable from each other address (of the plurality)having a unique capture probe, e.g., wherein the capture probes are froma cell or subject which express gene X or from a cell or subject inwhich a gene X-mediated response has been elicited, e.g., by contact ofthe cell with nucleic acid X or protein X, or administration to the cellor subject of a nucleic acid X or protein X; providing a second twodimensional array having a plurality of addresses, each address of theplurality being positionally distinguishable from each other address ofthe plurality, and each address of the plurality having a unique captureprobe, e.g., wherein the capture probes are from a cell or subject whichdoes not express gene X (or does not express as highly as in the case ofthe cell or subject described above for the first array) or from a cellor subject which in which a gene X-mediated response has not beenelicited (or has been elicited to a lesser extent than in the firstsample); contacting the first and second arrays with one or more inquiryprobes (which are preferably other than a nucleic acid X, protein X, orantibody specific for protein X), and thereby evaluating the pluralityof capture probes. Binding, e.g., in the case of a nucleic acid,hybridization with a capture probe at an address of the plurality, isdetected, e.g., by signal generated from a label attached to the nucleicacid, polypeptide, or antibody.

The invention also features a method of analyzing a plurality of probesor a sample. The method is useful, e.g., for analyzing gene expression.The method includes: providing a first two dimensional array having aplurality of addresses, each address of the plurality being positionallydistinguishable from each other address of the plurality having a uniquecapture probe, contacting the array with a first sample from a cell orsubject which express or mis-express gene X or from a cell or subject inwhich a gene X-mediated response has been elicited, e.g., by contact ofthe cell with nucleic acid X or protein X, or administration to the cellor subject of nucleic acid X or protein X; providing a second twodimensional array having a plurality of addresses, each address of theplurality being positionally distinguishable from each other address ofthe plurality, and each address of the plurality having a unique captureprobe, and contacting the array with a second sample from a cell orsubject which does not express gene X (or does not express as highly asin the case of the as in the case of the cell or subject described forthe first array) or from a cell or subject which in which a geneX-mediated response has not been elicited (or has been elicited to alesser extent than in the first sample); and comparing the binding ofthe first sample with the binding of the second sample. Binding, e.g.,in the case of a nucleic acid, hybridization with a capture probe at anaddress of the plurality, is detected, e.g., by a signal generated froma label attached to the nucleic acid, polypeptide, or antibody. The samearray can be used for both samples or different arrays can be used. Ifdifferent arrays are used the same plurality of addresses with captureprobes should be present on both arrays.

All the above listed capture probes useful for arrays can also beprovided in the form of a kit or article of manufacture, optionally alsocontaining packaging materials. In such kits or articles of manufacture,the capture probes can be provided as preformed arrays, i.e., attachedto appropriate substrates as described above. Alternatively they can beprovided in unattached form.

The capture probes can be supplied in unattached form in any number.Moreover, each capture probe in a kit or article of manufacture can beprovided in a separate vessel (e.g., bottle, vial, or package), all thecapture probes can be combined in the same vessel, or a plurality ofpools of capture probes can be provided, with each pool being providedin a separate vessel. In the kit or article of manufacture there canoptionally be instructions (e.g., on the packing materials or in apackage insert) on how to use the arrays or unattached capture probes,e.g., on how to perform any of the methods described herein.

The following examples are intended to illustrate, not limit, theinvention.

EXAMPLES Example 1 Materials and Methods Tissue Specimens and PrimaryCell Cultures

Human breast tumor and fresh, frozen, or formalin fixed, paraffinembedded tumor specimens were obtained from the Brigham and Women'sHospital (Boston, Mass.), Columbia University (New York, N.Y.),University of Cambridge (Cambridge, UK), Duke University (Durham, N.C.),University Hospital Zagreb (Zagreb, Croatia), the National DiseaseResearch Interchange (Philadelphia, Pa.), and the Breast Tumor Bank ofthe University of Liège (Liège, Belgium). All human tissue was collectedwithout patient identifiers using protocols approved by theInstitutional Review Boards of the institutions. In the case of matchedtissue samples (i.e., normal and tumor tissue samples obtained from thesame individuals), the normal tissue corresponding to the tumor wasobtained from the ipsilateral breast several centimeters away from thetumor. Fresh tissue samples were immediately processed forimmunomagnetic purification and cell subsets were purified as previouslydescribed [Allinen et al. (2004) Cancer Cell 6:17-32 and co-pending U.S.Patent Application Serial No. PCT/US2004/08866, the disclosures of whichare incorporated herein by reference in its entirety]. Following thepurification procedure, in some cases the purity of each cell populationwas confirmed by RT-PCR and primary cultures of the different cell typeswere initiated. Primary stromal fibroblasts were cultured in DMEM mediumsupplemented with 10% iron fortified bovine calf serum (Hyclone, Logan,Utah) prior to lysis and DNA and RNA isolation. Human embryonic stemcells were cultured on feeder layers using established protocols (forexample, see, REF). DNA and RNA were isolated from the other cell-typeswithout prior culturing.

RNA and Genomic DNA Isolation, and cDNA Synthesis

RNA (total and polyA) isolation was performed using a μMACS™ kit(Miltenyi Biotec, Auburn, Calif.) from small numbers of cells, whilefrom large tissue samples, primary cultures and cell lines total RNA wasisolated using a guanidium/cesium method [Allinen et al. (2004), supra].Column flow-through fractions (in the μMACS™ method) and unprecipitatedsoluble material (guanidium/cesium method) were used for thepurification of genomic DNA using SDS/proteinase K digestion followed byphenol-chloroform extraction and isopropanol precipitation. cDNAsynthesis was performed using the OMNI-SCRIPT™ kit form Qiagen(Valencia, Calif.) following the manufacturer's instructions.

Generation and Analysis of MSDK (Methylation Specific DigitalKaryotyping) Libraries

MSDK libraries were generated by a modification of the digitalkaryotping protocol [Wang et al. (2002) Proc. Natl. Acad. Sci USA16156-16161]. For each sample, 1-5 μg genomic DNA was sequentiallydigested with the methylation-sensitive enzyme AscI and the resultingfragments were ligated at their 5′ and 3′ ends to biotinylated linkers(5′-biotin-TTTGCAGAGGTTCGTAATCGAGTTGGGTGG-3′,5′-phos-CGCGCCACCCAACTCGATTACGAACCTCTGC-3′). The biotinylated fragmentswere then digested with NlaIII as a fragmenting restriction enzyme.Resulting DNA fragments having biotinylated linkers at their terminiwere immobilized onto streptavidin-conjugated magnetic beads (Dynal,Oslo, Norway).

The remaining steps were essentially the same as those described forLongSAGE with minor modifications [Allinen et al. (2004) supra; Saha etal. (2002) Nat. Biotechnol. 20:508-512]. Briefly, linkers containing thetype IIs restriction enzyme MmeI recognition site were ligated toisolated DNA fragments and the bead bound fragments were cut by the MmeIenzyme 21 base pairs away from the restriction enzyme site, resulting inrelease from the beads into the surrounding solution of tags containingthe MmeI recognition site, a linker and 21 base pairs of test genomicDNA. The tags were ligated to form ditags which are formed betweensingle tags containing 5′ and 3′ MmeI digestion (cut) sites (dependingon whether the relevant fragment bound to a bead was derived by from anNlaIII site 5′ or 3′ of an unmethylated AscI site). The ditags wereexpanded by PCR, isolated, and ligated to form concatamers, which werecloned into the pZero 1.0 vector (Invitrogen, Carlsbad, Calif.) andsequenced. 21-bp tags were extracted and duplicate ditags (arising dueto the PCR expansion step) were removed using SAGE 2002 software. Pvalues were calculated based on pair-wise comparisons between librariesusing a Poisson-based algorithm [Cai et al. (2004) Genome Biol. 5:R51;Allinen et al. (2004) supra]. Raw tag counts were used for comparing thelibraries and calculating p values, but subsequently tag numbers werenormalized in order to control for uneven total tag numbers/library(average total tag number 28,456/library).

In order to determine their chromosomal location, tags that appearedonly once in each library were filtered out and matched to a virtualAscI library derived from a human genome sequence. Human genome sequenceand mapping information (July 2003, hg16) were downloaded from UCSCGenome Bioinformatics Site. A virtual AscI tag library was constructedbased on the genome sequence as follows: predicted AscI sites werelocated in the genomic sequence, the nearest NlaIII sites in bothdirections to the AscI sites were identified, and the correspondingvirtual MSDK sequence tags were derived. All virtual tags that were notunique in the genome were removed in order to ensure unambiguous mappingof the data. Genes neighboring the AscI sites were also identified inorder to determine the effect of methylation on their expression.

Alignment of MSDK, SAGE, and CpG Islands Across the Genome

The frequency of AscI digestion was calculated as percentage of samples(N-EPI-17, I-EPI-7, N-MYOEP-4, D-MYOEP-6, N-STR-17, I-STR-7, N-STR-117,I-STR-17) having raw tag counts of 2 or more at each predicted AscIsite. SAGE counts from corresponding samples (N-EPI-1 plus N-EPI-2,I-EPI-7, N-MYOEP-1, D-MYOEP-6, D-MYOEP-7, N-STR-1, N-STRI-17, I-STR-7)were normalized to tags per 200,000. Gene and CpG island positioninformation were downloaded from UCSC Genome Bioinformatics Site (Humangenome sequence and mapping information, July 2003, hg 16). AscI siteswere predicted (as mentioned above) from the genome sequence, and AscIsite frequency, SAGE counts, and CpG island positions were drawntogether along all chromosomes.

Bisulfite Sequencing, Quantitative Methylation Specific PCR (qMSP), andQuantitative RT-PCR (qRT-PCR)

To determine the location of methylated cytosines, genomic DNA wasbisulfite treated, purified, and PCR reactions were performed aspreviously described [Herman et al. (1996) Proc. Natl. Acad. Sci. USA93:9821-0826]. PCR products were “blunt-ended”, subcloned into pZERO1.0(Invitrogen), and 4-13 independent colonies were sequenced for each PCRproduct.

Based on the above sequence analysis qMSP PCR primers were designed forthe amplification of methylated or unmethylated DNA. Quantitative MSPand RT-PCR amplifications were performed as follows. Template (2-5 ngbisulfite treated genomic DNA or 1 μl cDNA) and primers were mixed with2×SYBR Green master mix (ABI, CA) in a 25 μl volume and the reactionswere performed in ABI 7500 real time PCR system (50° C., 20 sec; 95° C.,10 min; 95° C., 15 sec, 60° C., 1 min (40 cycles); 95° C., 15 sec; 60°C., 20 sec; 95° C., 15 sec). Triplicates were performed and average Ctvalues calculated. The Ct (cycle threshold) value is the PCR cyclenumber at which the reaction reaches a fluorescent intensity above thethreshold which is set in the exponential phase of the amplification(based on amplification profile) to allow accurate quantification. Inthe case of qMSP, methylation of the samples was normalized tomethylation independent amplification of the β-actin (ACTB) gene: %ACTB=100×2^((CtACTB-Ctgene)). For qRT-PCR expression of the samples wasnormalized to that of the RPL39 (ribosomal protein L39) gene: %RPL39=10×2^((CtRPL39-Ctgene)). Normalizations to the expression of theribosomal protein L19 (RPL19) and ribosomal protein S13 (RPS13) geneswere also performed and gave essentially the same results. Due to thevery high abundance of ribosomal protein mRNAs, cDNA was dilutedten-fold for these PCR reactions relative to that of specific genes. Thefrequency of methylation of the PRDM14 gene in normal and tumor sampleswas calculated by setting a threshold of methylation as themedian+2×standard deviation value of the relative methylation of thenormal samples (excluding the one outlier case; see below). Samplesabove this value (10.66) were defined as methylated.

Example 2 Methylation Specific Digital Karyotyping (MSDK)

The MSDK protocol used in the experiments described below isschematically depicted in FIG. 2.

MSDK is a modification of the digital karyotyping (DK) techniquerecently developed for the analysis of DNA copy number in a quantitativemanner on a genome-wide scale [Wang et al. (2002) supra]. DK is based ontwo concepts: (i) short (e.g., 21 base pair) sequence tags can bederived from specific locations in the human genome; and (ii) thesesequence tags can be directly matched to the human genome sequence. Theoriginal DK protocol used SacI as a mapping enzyme and NlaIII as afragmenting enzyme. Using this enzyme combination the tags were obtainedfrom the two (both 5′ and 3′) NlaIII sites closest to the SacI sites.

In the MSDK method, instead of SacI, a mapping enzyme that is sensitiveto DNA methylation was used. AscI was chosen because its recognitionsequence (GGCGCGCC) has two CpG (potential methylation) sites, ispreferentially found in CpG islands associated with transcribed genesrather than repetitive elements [Dai et al. (2002) Genome Res.12:1591-1598], and it is a rare cutter enzyme (˜5,000 predictedsites/human genome) allowing identification of tags that are highlystatistically significantly differentially present in the differentlibraries at reasonable sequencing depths (20,000-50,000 tags/library).Methylation of either or both methylation sites in an AscI recognitionsequence prevents cutting by AscI. The use of AscI and NlaIII as mappingand fragmenting enzymes, respectively, with human genomic DNA,respectively, is expected to result in a total of 7,205 virtual tags(defined as possible tags that can be obtained and uniquely matched tothe human genome based on the predicted location of AscI and NlaIIIsites). Since AscI will cut only unmethylated DNA, the presence of a tagin the MSDK library indicates that the corresponding AscI site is notmethylated, while lack of a virtual tag indicates methylation.

To demonstrate the feasibility of the MSDK method for epigenomeprofiling, MSDK libraries were generated from genomic DNA isolated fromthe wild-type HCT116 human colon cancer cell line (HCT WT) and itsderivative in which both the DNMT1 and DNMT3b DNA methyltransferasegenes have been homozygously deleted (HCT DKO) [Rhee et al. (2002)Nature 416, 552-556]. Due to the deletion of these two DNAmethyltransferases, methylation of the genomic DNA in the HCT DKO cellsis reduced by greater than 95% relative to the HCT WT cells. Thus, MSDKlibraries generated from HCT WT and HCT DKO cells were expected todepict dramatic differences in DNA methylation. 21,278 and 24,775genomic tags were obtained from the WT and DKO cells, respectively.These tags were matched to a virtual AscI tag library generated asdescribed in Example 1. Unique tags (7,126 from the WT and 7,964 tagsfrom the DKO cells) were compared and 219 were identified as beingstatistically significantly (p<0.05) differentially present in the twolibraries (Table 1). 137 and 82 of these tags were more abundant in theDKO and WT libraries, respectively. Correlating with the overallhypomethylation of the genome of DKO cells, almost all of the 137 tagswere at least 10 fold more abundant in the DKO library, while nearly all82 tags showed only 2-5 fold difference between the two libraries.

TABLE 1 Chromosomal location and analysis of the frequency of MSDK tagsin the HCT116 WT and DKO MSDK libraries. Tag Variety Virtual Observed WTDKO Ratio Tag Copy Ratio Differential Tag (P < 0.05) Chr Tag Tag VarietyCopies Variety Copies DKO/WT DKO/WT DKO > WT WT > DKO  1 551 119 73 43189 538 1.219 1.248 10 6  2 473 94 51 383 72 499 1.412 1.303 10 5  3 34983 48 478 59 473 1.229 0.990 8 5  4 281 62 33 266 49 265 1.485 0.996 3 5 5 334 74 41 437 56 536 1.366 1.227 10 3  6 338 65 36 229 51 315 1.4171.376 8 4  7 403 90 60 359 66 344 1.100 0.958 4 4  8 334 89 54 460 73433 1.352 0.941 3 5  9 349 86 50 397 67 468 1.340 1.179 9 5 10 387 84 43386 71 468 1.651 1.212 10 4 11 379 96 55 408 75 392 1.364 0.961 6 4 12299 72 42 330 52 329 1.238 0.997 7 4 13 138 25 12 109 19 105 1.583 0.9631 1 14 228 51 28 234 36 225 1.286 0.962 4 3 15 260 52 38 243 37 1630.974 0.671 2 4 16 340 82 43 297 65 347 1.512 1.168 4 2 17 400 116 54401 100 781 1.852 1.948 16 3 18 181 39 19 115 29 199 1.526 1.730 7 0 19463 99 59 429 70 391 1.186 0.911 9 7 20 236 58 32 213 41 287 1.281 1.3474 2 21 71 11 7 27 6 43 0.857 1.593 1 0 22 217 51 31 328 38 260 1.2260.793 1 4 X 185 22 16 166 18 103 1.125 0.620 0 2 Y 9 0 0 0 0 0 Matches7205 1620 925 7126 1239 7964 1.339 1.118 137 82 No Matches 1353 799 5183816 5805 1.021 1.120 29 13 Total 7205 2973 1724 12309 2055 13769 1.1921.119 166 95 Chr, Chromosome. Virtual tags, the number of MSDK tagspecies predicted for the indicated chromosome. Observed Tags, thenumber of different unique tag species observed in both MSDK librariesfor the indicated chromosome. Variety, the number of different uniquetag species for the indicated chromosome and MSDK library. Copies, theabundance (total number) of all the observed unique tags for theindicated chromosome and MSDK library. Tag Variety Ratio, the ratio ofthe numbers of unique tag species for the indicated chromosome detectedin the indicated two libraries. Tag Copy Ratio, the ratio of theabundances (total numbers) of all the unique tags for the indicatedchromosomes detected in the indicated two libraries. Differential Tag (P< 0.05), the number of unique tag species observed for the indicatedchromosome that were present in higher abundance in the one indicatedMSDK library than in the other indicated MSDK library (P < 0.050).

Single nucleotide polymorphism (SNP) array analysis of the DNA samplesused for the generation of MSDK libraries demonstrated that the two celllines are indistinguishable using this technique and the observeddifferences in MSDK tag numbers are unlikely to be due to underlyingovert DNA copy number alterations. Mapping of the tags to the genomerevealed that many of the differentially methylated AscI sites arelocated in CpG islands and in promoter areas of genes implicated indevelopment and differentiation including numerous homeogenes (Table 2).Consistent with these results, two of these genes, LMX-1A and COL5A,have previously been found to be differentially methylated betweenHCT116 WT and DKO cells, and are also frequently methylated in primarycolorectal carcinomas and colon cancer cell lines [Paz et al. (2003)Hum. Mol. Genet. 12:2209-2210]. Similarly SCGB3A1/HIN-1, a genefrequently methylated in multiple cancer types [Shigematsu et al. (2005)Int. J. Cancer 113:600-604; Krop et al. (2004) Mol. Cancer Res.2:489-494; Krop et al. (2001) Proc. Natl. Acad. Sci. USA 98:9796-9801]was identified as one of most highly significantly differently presenttags (Table 2).

TABLE 2 MSDK tags significantly (p < 0.050) differentially present inHCT116 WT and DKO MSDK libraries and genes associates with the MSDKtags. Position of Distance of Ratio AscI site in AscI site from MSDK TagSEQ ID NO. DKO WT DKO/WT P value Chr Gene Description relation to tr.Start tr. Start (bp) GTGCCGCCGCGGGCGCC 19 14 0 14 0.0023908 1 KIAA0478KIAA0478 gene product 5′ 308006 GTGCCGCCGCGGGCGCC 20 14 0 14 0.0023908 1WNT4 wingless-type MMTV integration site family 5′ 733 GCACAATGAAAGCATTT21 0 8 −9 0.0375409 1 TCEB3 elongin A 3′ 78 GCTGGACACAATGGGTC 22 0 15−17 0.0007148 1 MACF1 microfilament and actin filament cross-linker 3′35 TGTGAGGGCGAGTGTGA 23 9 0 9 0.020643 1 HIVEP3 human immunodeficiencyvirus type I enhancer 3′ 392630 AGCACCCGCCTGGAACC 24 2 15 −8 0.0024514 1PTPRF protein tyrosine phosphatase, receptor type, F 3′ 727GCTCACCTACCCAGGTG 25 12 0 12 0.0056628 1 Not Found GCCTCTCTGCGCCTGCC 2615 0 15 0.0015534 1 GFI1 growth factor independent 1 3′ 4842CCCGGACTTGGCCAGGC 27 47 2 21 2.35 × 10⁻⁸ 1 NHLH2 nescient helix loophelix 2 3′ 2971 TTCGGGCCGGGCCGGGA 28 18 0 18 0.0004261 1 LMX1A LIMhomeobox transcription factor 1, alpha 5′ 752 AGCCCTCGGGTGATGAG 29 14 014 0.0023908 1 LMX1A LIM homeobox transcription factor 1, alpha 5′ 752CTTATGTTTACAGCATC 30 4 16 −4 0.0103904 1 PAPPA2 pappalysin 2 isoform 25′ 255915 CTTATGTTTACAGCATC 31 4 16 −4 0.0103904 1 RFWD2 ring finger andWD repeat domain 2 isoform a 5′ 21 GTTCTCAAACAGCTTTC 32 2 10 −60.0365508 1 IPO9 importin 9 3′ 343 TCCAGGCAGGGCCTCTG 33 16 42 −30.000352 1 BTG2 B-cell translocation gene 2 3′ 431 CCCCCGCGACGCGGCGG 3428 0 28 5.72 × 10⁻⁶ 1 SOX13 SRY-box 13 5′ 571 CCCCCGCGACGCGGCGG 34 28 028 5.72 × 10⁻⁶ 1 FLJ40343 hypothetical protein FLJ40343 5′ 31281GTGAACTTCCAAGATGC 36 14 0 14 0.0023908 1 CNIH3 cornichon homolog 3 3′ 50ATGCGCCCCGCAGCCCC 37 8 0 8 0.0317702 1 MGC13186 hypothetical proteinMGC13186 5′ 321138 ATGCGCCCCGCAGCCCC 38 8 0 8 0.0317702 1 SIPA1L2signal-induced proliferation-associated 1 like 5′ 114742GTCCCCGCGCCGCGGCC 39 23 0 23 4.94 × 10⁻⁵ 2 UBXD4 UBX domain containing 45′ 553390 GTCCCCGCGCCGCGGCC 40 23 0 23 4.94 × 10−5 2 APOB apolipoproteinB precursor 5′ 2343039 ATGCGAGGGGCGCGGTA 41 21 43 −2 0.0036483 2FLJ32954 hypothetical protein FLJ32954 5′ 277913 ATGCGAGGGGCGCGGTA 42 2143 −2 0.0036483 2 CDC42EP3 Cdc42 effector protein 3 5′ 366GCAGCATTGCGGCTCCG 43 36 0 36 1.82 × 10⁻⁷ 2 SIX2 sine oculis homeoboxhomolog 2 5′ 160394 TCATTGCATACTGAAGG 44 7 19 −3 0.0235641 2 SLC1A4solute carrier family 1, member 4 5′ 335302 TCATTGCATACTGAAGG 45 7 19 −30.0235641 2 SERTAD2 SERTA domain containing 2 5′ 245 GCGCTACACGCCGCTCC46 0 9 −10 0.0214975 2 SLC1A4 solute carrier family 1, member 4 5′ 111GCGCTACACGCCGCTCC 47 0 9 −10 0.0214975 2 SERTAD2 SERTA domain containing2 5′ 335436 CCCCAGCTCGGCGGCGG 48 53 0 53  1.19 × 10⁻¹⁰ 2 TCF7L1 HMG-boxtranscription factor TCF-3 3′ 859 CCTGGCCCTGTTGTGTC 49 8 0 8 0.0317702 2DUSP2 dual specificity phosphatase 2 5′ 26138 AAGCAGTCTTCGAGGGG 50 23 47−2 0.0022127 2 CNNM3 cyclin M3 isoform 1 5′ 396 GGAGGGCTGGAGTGAGG 51 120 12 0.020295 2 FLJ38377 hypothetical protein FLJ38377 3′ 593AGACCATCCTTGGACCC 52 15 0 15 0.0057312 2 B3GALT1 UDP-Gal:betaGlcNAc beta5′ 524869 GGCGCCAGAGGAAGATC 53 7 0 7 0.0488953 2 SSB autoantigen La 5′29950 CCCACCCGAGGGGAAGA 54 11 0 11 0.0087152 2 SP5 Sp5 transcriptionfactor 5′ 1824 TTAATCTGCTTATGAAA 55 0 7 −8 0.0172683 2 SP3 Sp3transcription factor 3′ 1637 AAATTCCATAGACAACC 56 11 0 11 0.0087152 2HOXD4 homeo box D4 3′ 1141 GGTGACAGAGTGCGACT 57 8 0 8 0.0317702 2 NotFound CAGCCGACTCTCTGGCT 58 7 0 7 0.0488953 3 DTYMK deoxythymidylatekinase (thymidylate kinase) 5′ 2784474 GGAGGCAAACGGGAACC 59 13 0 130.0036794 3 IQSEC1 IQ motif and Sec7 domain 1 5′ 315433GCTCGCCGAGGAGGGGC 60 16 0 16 0.0010093 3 RBMS3 RNA binding motif, singlestranded interacting 5′ 706157 GCTCGCCGAGGAGGGGC 61 16 0 16 0.0010093 3AZI2 5-azacytidine induced 2 isoform a 5′ 226210 GATCGCTGGGGTTTTGG 62 220 22 7.60 × 10⁻⁵ 3 DLEC1 deleted in lung and esophageal cancer 1 isoform5′ 9380 GATCGCTGGGGTTTTGG 63 22 0 22 7.60 × 10⁻⁵ 3 PLCD1 phospholipaseC, delta 1 5′ 200 CTAATCTCTCCATCTGA 64 0 8 −9 0.0375409 3 SS18L2synovial sarcoma translocation gene on 5′ 8746 CTAATCTCTCCATCTGA 65 0 8−9 0.0375409 3 SEC22L3 vesicle trafficking protein isoform b 5′ 129CGGCGCGTCCCTGCCGG 66 51 0 51  2.82 × 10⁻¹⁰ 3 DKFZp313N0621 hypotheticalprotein DKFZp313N0621 5′ 339665 AACCCCGAAACTGGAAG 67 7 0 7 0.0488953 3FAM19A4 family with sequence similarity 19 (chemokine 5′ 143GAAGAGTCCCAGCCGGT 68 15 40 −3 0.0004426 3 MDS010 x 010 protein 5′ 5211GAAGAGTCCCAGCCGGT 69 15 40 −3 0.0004426 3 TMEM39A transmembrane protein39A 5′ 116 GAGGAGAGAGATGGTCC 70 8 0 8 0.0317702 3 GPR156 Gprotein-coupled receptor 156 5′ 41213 CCTGCCTCTGGCAGGGG 71 18 32 −20.042895 3 PLXNA1 plexin A1 5′ 5386 GCCTAGAAGAAGCCGAA 72 25 46 −20.0076042 3 RAB43 RAB41 protein 5′ 577 GGGCCGAGTCCGGCAGC 73 17 0 170.0006558 3 CHST2 carbohydrate (N-acetylglucosamine-6-O) 3′ 61CGTGTGAGCTCTCCTGC 74 28 47 −2 0.0176231 3 EPHB3 ephrin receptor EphB3precursor 3′ 576 CACTTCCCAGCTCTGAG 75 6 17 −3 0.0294258 4 FGFR3fibroblast growth factor receptor 3 isoform 1 5′ 26779 CACATCCCAGCCCGGGG76 16 0 16 0.0037515 4 FLJ33718 hypothetical protein FLJ33718 3′ 30337CCTGCGCCGGGGGAGGC 77 40 57 −2 0.0483974 4 ADRA2C alpha-2C-adrenergicreceptor 3′ 432 TACAATGAAGGGGTCAG 78 13 0 13 0.0036794 4 STK32Bserine/threonine kinase 32B 5′ 28 TACAATGAAGGGGTCAG 79 13 0 13 0.00367944 CYTL1 cytokine-like 1 5′ 32301 TTGGTAAGCATTATCTC 80 0 7 −8 0.0172683 4WFS1 wolframin 3′ 400 GTCCGTGGAATAGAAGG 81 13 0 13 0.0036794 4 Not FoundTTTACATTTAATCTATG 82 0 6 −7 0.030837 4 HNRPDL heterogeneous nuclearribonucleoprotein D-like 3′ 741 TGCGGAGAAGACCCGGG 83 3 13 −5 0.0196518 4ELOVL6 ELOVL family member 6, elongation of long 3′ 1583 chainGGAGGTCTCAGGATCCC 84 10 23 −3 0.0264674 5 FLJ20152 hypothetical proteinFLJ20152 5′ 108193 AAAGCGATCCAAACACA 85 7 0 7 0.0488953 5 BASP1 brainabundant, membrane attached signal 3′ 182 protein ACCCGGGCCGCAGCGGC 8638 2 17 1.10 × 10⁻⁶ 5 EFNA5 ephrin-A5 3′ 1019 CTGGGTTGCGATTAGCT 87 15 015 0.0015534 5 PPIC peptidylprolyl isomerase C 5′ 62181ACACATTTATTTTTCAG 88 24 50 −2 0.0011958 5 KIAA1961 KIAA1961 proteinisoform 1 3′ 146 GTGGGAGTCAAAGAGCT 89 26 49 −2 0.0042447 5 APXL2 apicalprotein 2 5′ 4006 TCGCCGGGCGCTTGCCC 90 48 0 48 1.03 × 10⁻⁹ 5 PITX1paired-like homeodomain transcription factor 1 3′ 6163 CTGACCGCGCTCGCCCC91 10 0 10 0.013413 5 PACAP proapoptotic caspase adaptor protein 5′ 4496CGTCTCCCATCCCGGGC 92 7 0 7 0.0488953 5 CPLX2 complexin 2 3′ 1498TGCCACCCGGAGTCGCA 93 9 0 9 0.020643 5 Not Found CTGCCCTTATCCTCGGA 94 150 15 0.0015534 5 FLT4 fms-related tyrosine kinase 4 isoform 1 3′ 28178CGCTGACCACCAGGAGG 95 8 0 8 0.0317702 5 FLT4 fms-related tyrosine kinase4 isoform 1 5′ 24508 GCAGAAAAAGCACAAAG 96 11 0 11 0.0087152 5 FLT4fms-related tyrosine kinase 4 isoform 1 5′ 24508 GTCCTTGTTCCCATAGG 97 190 19 0.0002769 6 FOXC1 forkhead box C1 5′ 5056 TCAATGCTCCGGCGGGG 98 12 012 0.0056628 6 TFAP2A transcription factor Ap-2 alpha 5′ 4264GCAGCCGCTTCGGCGCC 99 2 14 −8 0.00425 6 EGFL9 EGF-like-domain, multiple 93′ 134 AGCTCTGAAGCCAGAAG 100 10 0 10 0.013413 6 VEGF vascularendothelial growth factor 5′ 52081 AGCTCTGAAGCCAGAAG 101 10 0 100.013413 6 MRPS18A mitochondrial ribosomal protein S18A 5′ 30336CCCTCCGATTCTACTAT 102 0 6 −7 0.030837 6 COL12A1 alpha 1 type XIIcollagen short isoform 3′ 394 AAGGAGACCGCACAGGG 103 13 0 13 0.0036794 6HTR1E 5-hydroxytryptamine (serotonin) receptor 1E 5′ 97AAGGAGACCGCACAGGG 104 13 0 13 0.0036794 6 SYNCRIP synaptotagmin binding,cytoplasmic RNA 5′ 1294285 ATTGTCAGATCTGGAAT 105 9 0 9 0.020643 6 MAP3K7mitogen-activated protein kinase kinase kinase 7 5′ 24225TGGTGATAACTGAACCC 106 15 29 −2 0.0333315 6 C6orf66 hormone-regulatedproliferation-associated 20 3′ 806 TCCATAGATTGACAAAG 107 27 0 27 8.80 ×10⁻⁶ 6 MARCKS myristoylated alanine-rich protein kinase C 3′ 3067TACAAGGCACTATGCTG 108 6 16 −3 0.0455421 6 MCMDC1 minichromosomemaintenance protein domain 3′ 518 GTTATGGCCAGAACTTG 109 19 2 8 0.00330396 MOXD1 monooxygenase, DBH-like 1 5′ 26536 CAACCCACGGGCAGGTG 110 25 0 258.07 × 10⁻⁵ 6 TAGAP T-cell activation Rho GTPase-activating protein 5′123822 ATGAGTCCATTTCCTCG 111 8 0 8 0.0317702 7 MGC10911 hypotheticalprotein MGC10911 5′ 96664 ACCTGGAATAAACCCTG 112 0 7 −8 0.0172683 7 RAM2transcription factor RAM2 3′ 259 TATTTGCCAAGTTGTAC 113 6 17 −3 0.02942587 HOXA11 homeobox protein A11 3′ 622 ACAAAAATGATCGTTCT 114 10 24 −30.0177309 7 PLEKHA8 pleckstrin homology domain containing, family A 3′159 GGCTCTCCGTCTCTGCC 115 10 0 10 0.013413 7 CRHR2 corticotropinreleasing hormone receptor 2 3′ 521 GTCCCCAGCACGCGGTC 116 13 0 130.0036794 7 TBX20 T-box transcription factor TBX20 5′ 607CCTTGACTGCCTCCATC 117 11 0 11 0.0087152 7 WBSCR17 Williams Beurensyndrome chromosome region 5′ 512 17 TCTGAGTCGCCAGCGTC 118 4 18 −50.0037714 7 AASS aminoadipate-semialdehyde synthase 5′ 171064GGGGCCTATTCACAGCC 119 23 49 −2 0.0010583 8 TNKS tankyrase,TRF1-interacting ankyrin-related 5′ 404285 GGGGCCTATTCACAGCC 120 23 49−2 0.0010583 8 PPP1R3B protein phosphatase 1, regulatory (inhibitor) 5′953 CCAGACGCCGGCTCGGC 121 5 15 −3 0.036438 8 ZDHHC2 rec 3′ 683GTGACGATGGAGGAGCT 122 28 54 −2 0.001831 8 DUSP4 dual specificityphosphatase 4 isoform 1 3′ 629 CTCCTCCTTCTTTTGCG 123 3 12 −4 0.0325442 8ADAM9 a disintegrin and metalloproteinase domain 9 3′ 542GCGGGGGCAGCAGACGC 124 20 0 20 0.0001799 8 PRDM14 PR domain containing 143′ 768 TAACTGTCCTTTCCGTA 125 21 0 21 0.0001169 8 Not FoundAAGAGGCAGAACGTGCG 126 37 0 37 1.18 × 10⁻⁷ 8 KCNK9 potassium channel,subfamily K, member 9 3′ 360 CTTGCCTCTCATCCTTC 127 24 53 −2 0.0003864 8Sharpin shank-interacting protein-like 1 3′ 328 AAATGAAACTAGTCTTG 128 211 −6 0.0215511 9 ANKRD15 ankyrin repeat domain protein 15 5′ 171831TCTGTGTGCTGTGTGCG 129 3 14 −5 0.011762 9 SMARCA2 SWI/SNF-relatedmatrix-associated 3′ 1580 TAAATAGGCGAGAGGAG 130 13 57 −5 2.87 × 10⁻⁸ 9FLJ46321 FLJ46321 protein 5′ 299849 TAAATAGGCGAGAGGAG 131 13 57 −5 2.87× 10⁻⁸ 9 TLE1 transducin-like enhancer protein 1 5′ 241GCGGGCGGCGCGGTCCC 132 35 0 35 2.79 × 10⁻⁷ 9 LHX6 LIM homeobox protein 6isoform 1 3′ 408 AGGCAGGAGATGGTCTG 133 13 0 13 0.0133334 9 PRDM12 PRdomain containing 12 5′ 5017 GGCGTTAATAGAGAGGC 134 7 0 7 0.0488953 9PRDM12 PR domain containing 12 5′ 5017 AGGTTGTTGTTCTTGCA 135 19 0 190.0002769 9 PRDM12 PR domain containing 12 3′ 1427 AAGGAGCCTACGTTAAT 1363 12 −4 0.0325442 9 UBADC1 ubiquitin associated domain containing 1 3′10 GATAAGAAGGATGAGGA 137 18 0 18 0.0004261 9 BTBD14A BTB (POZ) domaincontaining 14A 5′ 98790 GCCTTCGACCCCCAGGC 138 9 0 9 0.020643 9 BTBD14ABTB (POZ) domain containing 14A 5′ 98790 CAGCCAGCTTTCTGCCC 139 38 0 387.67 × 10⁻⁸ 9 LHX3 LIM homeobox protein 3 isoform b 5′ 146TCCGCCTGTGACTCAAG 140 11 0 11 0.0087152 9 CLIC3 chloride intracellularchannel 3 3′ 1683 GTCCTGCTCCTCAAGGG 141 28 0 28 5.72 × 10⁻⁶ 9 CLIC3chloride intracellular channel 3 3′ 1683 GGGGAAGCTTCGAGCGC 142 5 16 −40.0229995 9 Not Found AAAATAGAGGTTCCTCC 143 10 25 −3 0.0117571 10 PRPF18PRP18 pre-mRNA processing factor 18 5′ 58621 homolog AAAATAGAGGTTCCTCC144 10 25 −3 0.0117571 10 C10orf30 chromosome 10 open reading frame 305′ 25417 AATGAACGACCAGACCC 145 20 37 −2 0.0188826 10 DDX21 DEAD(Asp-Glu-Ala-Asp) box polypeptide 21 3′ 506 AGTTAGTTCCCAACTCA 146 2 10−6 0.0365508 10 MLR2 ligand-dependent corepressor 5′ 84AGTTAGTTCCCAACTCA 147 2 10 −6 0.0365508 10 PIK3AP1phosphoinositide-3-kinase adaptor protein 1 5′ 112373 TGGATTTGGGTTTTCAG148 10 0 10 0.013413 10 HPSE2 heparanase 2 3′ 2954 GGGACAGGTGGCAGGCC 14933 0 33 6.62 × 10⁻⁶ 10 PAX2 paired box protein 2 isoform b 5′ 6126GAGCTAATCAATAGGCA 150 7 0 7 0.0488953 10 PAX2 paired box protein 2isoform b 5′ 6126 GTTTCCTTATTAATAGA 151 4 24 −7 0.0001591 10 TRIM8tripartite motif-containing 8 5′ 375 CCCCGTGGCGGGAGCGG 152 26 0 26 5.26× 10⁻⁵ 10 NEURL neuralized-like 5′ 630 CCCCGTGGCGGGAGCGG 153 26 0 265.26 × 10−5 10 FAM26A family with sequence similarity 26, member A 5′14420 GAGGTAGTGCCCTGTCC 154 13 0 13 0.0036794 10 SH3MD1 SH3 multipledomains 1 3′ 24 TTGTGTGTACATAGGCC 155 8 0 8 0.0317702 10 SORCS1 SORCSreceptor 1 isoform a 5′ 1301646 GCAGGACGGCGGGGCCA 156 8 0 8 0.0317702 10LHPP phospholysine phosphohistidine inorganic 5′ 14183 GCAGGACGGCGGGGCCA157 8 0 8 0.0317702 10 OAT ornithine aminotransferase precursor 5′ 28768GGGCCCCGCCCAGCCAG 158 11 0 11 0.0087152 10 C10orf137 erythroiddifferentiation-related factor 1 5′ 556810 GGGCCCCGCCCAGCCAG 159 11 0 110.0087152 10 CTBP2 C-terminal binding protein 2 isoform 1 5′ 2249CCTGGAAGGAATTTAGG 160 8 0 8 0.0317702 10 PTPRE protein tyrosinephosphatase, receptor type, E 3′ 408 GGAGTTCCATCTCCGAG 161 13 0 130.0036794 10 MGMT O-6-methylguanine-DNA methyltransferase 5′ 1317729GGAGTTCCATCTCCGAG 162 13 0 13 0.0036794 10 MKI67 antigen identified bymonoclonal antibody Ki- 5′ 23268 67 GAAAACTCCAGATAGTG 163 17 0 170.0006558 11 ASCL2 achaete-scute complex homolog-like 2 3′ 582CTTTGAAATAAGCGAAT 164 3 13 −5 0.0196518 11 PDE3B phosphodiesterase 3B,cGMP-inhibited 3′ 526 GGCAGGAGGATGCGGGG 165 5 15 −3 0.036438 11 FJX1four jointed box 1 3′ 725 TCTAGGACCTCCAGGCC 166 14 32 −3 0.0066996 11SLC39A13 solute carrier family 39 (zinc transporter) 5′ 415TCTAGGACCTCCAGGCC 167 14 32 −3 0.0066996 11 SPI1 spleen focus formingvirus (SFFV) proviral 5′ 29668 CCCTGCCCTTAGTGCTT 168 7 0 7 0.0488953 11Not Found GCCAACCTGAAGACCCC 169 7 0 7 0.0488953 11 SSSCA1 Sjogren'ssyndrome/scleroderma autoantigen 1 5′ 12479 GCCAACCTGAAGACCCC 170 7 0 70.0488953 11 LTBP3 latent transforming growth factor beta binding 5′ 33GCCCCCTAGGCCCTTTG 171 10 0 10 0.013413 11 FGF19 fibroblast growth factor19 precursor 5′ 44445 CTGCAAAATCTGCTCCT 172 5 16 −4 0.0229995 11 NotFound GCTCGACCCAGCTGGGA 173 7 0 7 0.0488953 11 ROBO3 roundabout, axonguidance receptor, homolog 3 5′ 534 GCTCGACCCAGCTGGGA 174 7 0 70.0488953 11 FLJ23342 hypothetical protein FLJ23342 5′ 64448GATTATGAAAGCCCATC 175 14 0 14 0.0023908 11 BARX2 BarH-like homeobox 2 5′2434 GATTATGAAAGCCCATC 176 14 0 14 0.0023908 11 RICS RhoGTPase-activating protein 5′ 349388 GAACAAACCCAGGGATC 177 9 0 9 0.02064312 KCNA1 potassium voltage-gated channel, shaker-related 5′ 1403TGTGTTCAGAGGGCGGA 178 7 0 7 0.0488953 12 GPR92 putative Gprotein-coupled receptor 92 3′ 15529 CCTGCCGGTGGAGGGCA 179 13 0 130.0036794 12 ST8SIA1 ST8 alpha-N-acetyl-neuraminide 5′ 176GCTGCCCCAAGTGGTCT 180 11 0 11 0.0087152 12 Not Found AGAACGGGAACCGTCCA181 19 0 19 0.0002769 12 CENTG1 centaurin, gamma 1 3′ 3647TCTCCGTGTATGTGCGC 182 6 20 −4 0.0074301 12 HMGA2 high mobility groupAT-hook 2 3′ 1476 TTTCAGCGGGAGCCGCC 183 10 0 10 0.013413 12 KIAA1853KIAA1853 protein 5′ 64 GAGGCCAGATTTTCTCC 184 40 64 −2 0.007793 12 HIP1Rhuntingtin interacting protein-1-related 5′ 170 AAGGCTGGGAGTTTTCT 185 2338 −2 0.0434041 12 ABCB9 ATP-binding cassette, sub-family B 3′ 517(MDR/TAP), CGAACTTCCCGGTTCCG 186 18 0 18 0.0004261 12 Not FoundCAGCGGCCAAAGCTGCC 187 16 31 −2 0.0259626 12 RAN ras-related nuclearprotein 5′ 257 CAGCGGCCAAAGCTGCC 188 16 31 −2 0.0259626 12 EPIMepimorphin isoform 2 5′ 32499 CACTGCCTGATGGTGTG 189 23 0 23 0.0001899 13IL17D interleukin 17D precursor 3′ 277 CCACCAGCCTCCCTCGG 190 19 36 −20.0173058 13 DOCK9 dedicator of cytokinesis 9 5′ 1277 AGCTCTGCCAGTAGTTG191 10 26 −3 0.0077231 14 MTHFD1 methylenetetrahydrofolate dehydrogenase1 5′ 49925 AGCTCTGCCAGTAGTTG 192 10 26 −3 0.0077231 14 ESR2 estrogenreceptor 2 5′ 44089 CCTCTAGGACCAAGCCT 193 12 0 12 0.0056628 14 SLC8A3solute carrier family 8 member 3 isoform B 3′ 270 CTACCTAAGGAGAGCAG 1942 13 −7 0.0073393 14 MED6 mediator of RNA polymerase II transcription,5′ 41006 GAGTCGCAGTATTTTGG 195 12 25 −2 0.0345796 14 GTF2A1 TFIIA alpha,p55 isoform 1 3′ 181 CGGCGCAGCTCCAGGTC 196 13 0 13 0.0036794 14 KCNK10potassium channel, subfamily K, member 10 3′ 3468 GGCCGGTGCCGCCAGTC 19710 0 10 0.013413 14 EML1 echinoderm microtubule associated protein like1 5′ 62907 GGGACCCGGAAAGGTGG 198 13 0 13 0.0036794 14 KIAA1446brain-enriched guanylate kinase-associated 3′ 1674 GCTCTGCCCCCGTGGCC 1999 23 −3 0.0148748 15 BAHD1 bromo adjacent homology domain containing 15′ 138 AGAGCTGAGTCTCACCC 200 8 20 −3 0.0285917 15 CDAN1 codanin 1 3′ 359TCAGGCTTCCCCTTCGG 201 4 13 −4 0.0445448 15 PIAS1 protein inhibitor ofactivated STAT, 1 5′ 190450 CCTGTGGACAGGATACC 202 8 0 8 0.0317702 15LRRN6A leucine-rich repeat neuronal 6A 5′ 140491 TGGGGACTGATGCACCC 203 012 −13 0.0009509 15 CIB2 DNA-dependent protein kinase catalytic 3′ 598GCAGTAAACCGTGACTT 204 7 0 7 0.0488953 15 ADAMTSL3 ADAMTS-like 3 5′ 114CGCACTCACACGGACGA 205 7 0 7 0.0488953 16 ZNF206 zinc finger protein 2063′ 3376 ATCCGGCCAAGCCCTAG 206 10 0 10 0.013413 16 ATF7IP2 activatingtranscription factor 7 interacting 5′ 244550 ATCCGGCCAAGCCCTAG 207 10 010 0.013413 16 GRIN2A N-methyl-D-aspartate receptor subunit 2A 5′ 809CGATTCGAAGGGAGGGG 208 27 0 27 3.43 × 10⁻⁵ 16 IRX6 iroquois homeoboxprotein 6 5′ 386305 CCTAACAAGATTGCATA 209 14 32 −3 0.0066996 16 DDX19DEAD (Asp-Glu-Ala-As) box polypeptide 19 5′ 23 CCTAACAAGATTGCATA 210 1432 −3 0.0066996 16 AARS alanyl-tRNA synthetase 5′ 9662 TCCCGCGCCCAGGCCCC211 11 0 11 0.0087152 16 ZCCHC14 zinc finger, CCHC domain containing 143′ 143 GCAACAGCCTCCGGAGG 212 0 8 −9 0.0375409 16 TUBB3 tubulin, beta, 43′ 843 CACAGCCAGCCTCCCAG 213 36 0 36 1.82 × 10⁻⁷ 17 LHX1 LIM homeoboxprotein 1 3′ 3701 CCTACCTATCCCTGGAC 214 14 0 14 0.0023908 17 STAT5Asignal transducer and activator of transcription 3′ 1085GCTATGGGTCGGGGGAG 215 42 0 42 1.37 × 10⁻⁸ 17 SOST sclerostin precursor3′ 3140 GATGCTCGAACGCAGAG 216 7 0 7 0.0488953 17 SOST sclerostinprecursor 3′ 3140 GTGAAATTCCCGTCTCT 217 23 0 23 4.94 × 10⁻⁵ 17 Not FoundGAGGCTGGCACCCAGGC 218 13 0 13 0.0036794 17 C1QL1 complement component 1,q subcomponent-like 1 3′ 8471 CCCCCAGAGTGACTAAG 219 10 0 10 0.013413 17ProSAPiP2 ProSAPiP2 protein 3′ 13991 TTGAGAACTGCCCCCCT 220 3 12 −40.0325442 17 HOXB9 homeo box B9 3′ 455 CCCCGTTTTTGTGAGTG 221 11 23 −20.0443851 17 HOXB9 homeo box B9 5′ 20620 GGGCGGTGGCAAGGGGC 222 9 0 90.020643 17 NXPH3 neurexophilin 3 3′ 20 CTTAGCCCACAGAGAAC 223 18 0 180.0004261 17 FLJ20920 hypothetical protein FLJ320920 3′ 43255CATTTCCTGGGCTATTT 224 10 0 10 0.013413 17 MRC2 mannose receptor, C type2 3′ 527 GTGACCAGCCTGGAGAG 225 15 0 15 0.0015534 17 SDK2 sidekick 2 5′206723 CCCCTGCCCTGTCACCC 226 30 0 30 2.41 × 10⁻⁶ 17 SLC9A3R1 solutecarrier family 9 (sodium/hydrogen) 3′ 11941 CTGAATGGGGCAAGGAG 227 48 048 1.03 × 10⁻⁹ 17 ENPP7 ectonucleotide 5′ 628261pyrophosphatase/phosphodiesterase CCTCTTCCCAGACCGAA 228 13 0 130.0036794 17 CBX4 chromobox homolog 4 5′ 1307 ACCCGCACCATCCCGGG 229 91 091  3.74 × 10⁻¹⁷ 17 CBX4 chromobox homolog 4 5′ 4600 GCTGCGGGCACCGGGCG230 25 0 25 2.08 × 10⁻⁵ 17 raptor raptor 5′ 66979 GCTGCGGGCACCGGGCG 23125 0 25 2.08 × 10⁻⁵ 17 NPTX1 neuronal pentraxin I precursor 5′ 1684CCTCGGTGAGTGTCTCG 232 4 22 −6 0.0004645 17 P4HB prolyl 4-hydroxylase,beta subunit 5′ 67 TCCCTCATTCGCCCCGG 233 43 18 2 0.0314243 18 EMILIN2elastin microfibril interfacer 2 3′ 143 GAAAAGTTGAACTCCTG 234 12 0 120.0056628 18 C18orf1 chromosome 18 open reading frame 1 isoform 3′ 20803alpha GTGGAGGGGAGGTACTG 235 8 0 8 0.0317702 18 IER3IP1 immediate earlyresponse 3 interacting protein 5′ 70905 TGAAGAAAAGGCCTTTG 236 9 0 90.020643 18 ACAA2 acetyl-coenzyme A acyltransferase 2 5′ 380776GCCCGCGGGGCTGTCCC 237 9 0 9 0.020643 18 GALR1 galanin receptor 1 5′ 146GCCCGCGGGGCTGTCCC 238 9 0 9 0.020643 18 MBP myelin basic protein 5′232612 TCCTGTCTCATCTGCGA 239 9 0 9 0.020643 18 SALL3 sal-like 3 5′ 463TCTCGGCGCAAGCAGGC 240 12 0 12 0.0056628 18 SALL3 sal-like 3 3′ 1008TCCGGAGTTGGGACCTC 241 14 0 14 0.0087469 19 Not Found GCAAACATCAGGACCAC242 9 0 9 0.020643 19 KIAA0963 KIAA0963 3′ 51678 AACGGGATCCGCACGGG 243 80 8 0.0317702 19 APC2 adenomatosis polyposis coli 2 3′ 18214GCCTTCCTGTCCCCCAA 244 0 8 −9 0.0096701 19 KLF16 BTE-binding protein 4 3′2472 GTGCCAGGAAGCAAGTC 245 10 22 −2 0.0390686 19 AP3D1 adaptor-relatedprotein complex 3, delta 1 3′ 328 AGCCTGCAAAGGGGAGG 246 17 34 −20.0142228 19 AKAP8L A kinase (PRKA) anchor protein 8-like 5′ 13794GGGTAGAACCTGGGGGA 247 28 0 28 2.23 × 10⁻⁵ 19 GTPBP3 GTP binding protein3 (mitochondrial) isoform 3′ 2019 CCCGCTCCTTCGGTTCG 248 5 16 −40.0229995 19 ITPKC inositol 1,4,5-trisphosphate 3-kinase C 5′ 273CCCGCTCCTTCGGTTCG 249 5 16 −4 0.0229995 19 ADCK4 aarF domain containingkinase 4 5′ 134 CGTGGGAAACCTCGATG 250 15 31 −2 0.0163452 19 ASE-1CD3-epsilon-associated protein; antisense to 5′ 1320 CGTGGGAAACCTCGATG251 15 31 −2 0.0163452 19 PPP1R13L protein phosphatase 1, regulatory(inhibitor) 5′ 11721 AGACTAAACCCCCGAGG 252 18 44 −3 0.0005081 19 ASE-1CD3-epsilon-associated protein; antisense to 3′ 824 CTAGAAGGGGTCGGGGA253 16 0 16 0.0010093 19 CALM3 calmodulin 3 5′ 129594 CTAGAAGGGGTCGGGGA254 16 0 16 0.0010093 19 FLJ10781 hypothetical protein FLJ10781 5′ 140TACAGCTGCTGCAGCGC 255 7 0 7 0.0488953 19 GRIN2D N-methyl-D-aspartatereceptor subunit 2D 3′ 48538 GTTTATTCCAAACACTG 256 7 0 7 0.0488953 19GRIN2D N-methyl-D-aspartate receptor subunit 2D 3′ 48538CGGGGTTTCTATGGTAA 257 7 19 −3 0.0235641 19 MYADM myeloid-associateddifferentiation marker 3′ 986 CCCAACCAATCTCTACC 258 13 0 13 0.0036794 19ZNF274 zinc finger rotein 274 isoform b 3′ 323 CGTAGGGCCGTTCACCC 259 7 07 0.0488953 19 ZNF42 zinc finger protein 42 isoform 1 3′ 10788CTCACGACGCCGTGAAG 260 40 67 −2 0.0032581 20 SOX12 SRY (sex determiningregion Y)-box 12 3′ 123 TCAGCCCAGCGGTATCC 261 0 9 −10 0.0214975 20 RRBP1ribosome binding protein 1 3′ 270 GTTTACCCTCTGTCTCC 262 19 0 190.0002769 20 RIN2 RAB5 interacting protein 2 5′ 130452 GGGTGCGGAACCCGGCC263 16 0 16 0.0010093 20 Not Found CCAGCTTTAGAGTCAGA 264 40 0 40 1.29 ×10⁻⁷ 20 Not Found GGGAATAGGGGGGCGGG 265 14 0 14 0.0087469 20 CDH22cadherin 22 precursor 5′ 56203 ACCCTGAAAGCCTAGCC 266 24 0 24 3.21 × 10⁻⁵21 ITGB2 integrin beta chain, beta 2 precursor 5′ 10805TTCCAAAAAGGGGCAGG 267 3 16 −6 0.0041258 22 XBP1 X-box binding protein 15′ 82906 CCCACCAGGCACGTGGC 268 21 40 −2 0.0105097 22 NPTXR neuronalpentraxin receptor isoform 1 5′ 376 GCCTCAGCATCCTCCTC 269 18 0 180.0004261 22 FLJ27365 FLJ27365 protein 5′ 24574 GCCTCAGCATCCTCCTC 270 180 18 0.0004261 22 FLJ10945 hypothetical protein FLJ10945 5′ 7284GCCCTGGGGTGTTATGG 271 8 22 −3 0.012181 22 FLJ27365 FLJ27365 protein 5′13829 GCCCTGGGGTGTTATGG 272 8 22 −3 0.012181 22 FLJ10945 hypotheticalprotein FLJ10945 5′ 18029 GGCAGGAAGACGGTGGA 273 10 22 −2 0.0390686 22ACR acrosin precursor 5 63440 GGCAGGAAGACGGTGGA 274 10 22 −2 0.039068622 ARSA arylsulfatase A precursor 5′ 46630 GGGGCGAAGAAAGCAGA 275 8 28 −40.0007679 23 STAG2 stromal antigen 2 5′ 1402 GAAGCAAGAGTTTGGCC 276 19 34−2 0.0335364 23 FLNA filamin 1 (actin-binding protein-280) 3′ 3103 DKOand WT, raw abundance (total numbers) of indicated MSDK observed in DKOand WT libraries. Ratio DKO/WT, ratio of normalized abundances (totalnumbers) of the indicated tag in the DKO and WT libraries (a minus signindicates that the indicated number is the reciprocal of the DKO/WTratio). P value, the significance of the difference in the rawabundances of the relevant MSDK tag between the two libraries. Chr,chromosome in which MSDK tag sequence is located. Gene, gene with whichthe indicated MSDK tag was associated. Description, description of theproduct of the associated gene. The positions of the AscI site(recognition sequence) identified by the indicated tag relative to thetranscription initiation site (tr. Start) of the gene and the distanceof the ArcI site (recognition sequence) from the transcriptioninitiation site are indicated.

In order to further validate the MSDK technique, three highlydifferentially present tags were selected from the HCT libraries, thecorresponding genomic loci (corresponding to the LHX3, LMX-1A, andTCF7L1 genes) were identified, and sequencing of bisulfite treatedgenomic DNA (the same as that used for the generation of the MSDKlibraries) was performed. In all three cases, the relevant AscI site wascompletely methylated in the WT and unmethylated in the DKO cells (FIGS.3-5). In addition, almost all other surrounding CpG showed the samemethylation/unmethylation pattern. In FIGS. 6-8 are shown the nucleotidesequences of regions of these three gene segments of which weresubjected to the described methylation-detecting sequencing analysis.These results indicated that the MSDK method is suitable for genome-wideanalysis of methylation patterns and the identification ofdifferentially methylated sites.

Example 3 Analysis of MSDK Libraries from Cell Populations Isolated fromNormal and Cancerous Breast Tissue

MSDK libraries were generated from epithelial cells, myoepithelialcells, and fibroblast-enriched stroma isolated from normal breasttissue, in situ (DCIS-ductal carcinoma in situ) breast carcinoma tissue,and invasive breast carcinoma tissue. A detailed description of thesamples is in Table 3.

TABLE 3 List of breast tissue samples used for methylation analyses.Name Organ Histology Cell type Tumor name Age Histology Grade LN ER PRHer2 D-MYOEP-6 breast tumor myoepithelial DCIS-6 29 pure extensive DCIShigh D-EPI-6 breast tumor epithelial DCIS-6 29 pure extensive DCIS highD-MYOEP-7 breast tumor myoepithelial DCIS-7 29 ext. DCIS adjacent to IDCintermediate pos low pos neg N-EPI-I7 breast normal epithelial 47 normalmatched to tumor I-EPI-7 breast tumor epithelial IDC-7 47 invasiveductal carcinoma low pos pos pos neg N-STR-I7 breast normal stroma 47normal matched to tumor I-STR-7 breast normal stroma IDC-7 47 invasiveductal carcinoma low pos pos pos neg N-STR-I17 breast normal stroma 44normal matched to tumor I-STR-17 breast tumor stroma IDC-17 44 invasiveductal carcinoma intermediate N-MYOEP-4 breast normal myoepithelial 25normal reduction N-EPI-4 breast normal epithelial 25 normal reductionN-MYOEP-6 breast normal myoepithelial 19 normal reduction N-MYOEP-3breast normal myoepithelial 24 normal reduction N-STR-7 breast normalstroma 26 normal reduction I-STR-11 breast tumor stroma IDC-11 43invasive ductal carcinoma low pos pos pos neg N-PBS-1 breast normalculture 38 normal reduction N-EPI-5 breast normal epithelial 58 normalmatched to tumor high neg neg neg neg I-EPI-9 breast tumor epithelialIDC-9 45 invasive ductal carcinoma intermediate pos pos neg HCT-WT colontumor cell line HCT-DKO colon tumor cell line The numbers at the ends ofthe tissue sample names indicate patients from which the tissue sampleswere obtained. Age is the age of the particular patient. LN indicateswhether the carcinoma in the relevant patient had spread to one or morelymph nodes. ER indicates whether the relevant carcinoma cells expressedthe estrogen receptor. PR indicates whether the relevant carcinoma cellsexpressed the progesterone receptor. Her2 indicates whether the relevantcarcinoma cells expressed Her2/Neu. Grade is the histologic grade.

Whenever possible, normal and tumor tissue were derived from the samepatient in order to control for possible epigenetic variations due toage, and reproductive and disease status. Fibroblast-enriched stromawere the cells remaining after removal of epithelial cells,myoepithelial cells, leukocytes, and endothelial cells and consist ofover 80% fibroblasts. DNA samples were also analyzed with SNP arrays inorder to rule out the possibility of overt DNA copy number alterations.

Pair-wise comparisons and statistical analyses of the MSDK librariesrevealed that the largest fraction of highly (>10 fold difference)differentially present tags occurred between normal and tumor epithelialcells and the majority of these tags were more abundant in tumor cells(Tables 4 and 5) correlating with the known overall hypomethylation ofthe cancer genome [Feinberg et al. (1983) Nature 301: 89-92).

TABLE 4 Chromosomal location and analysis of the frequency of MSDK tagsin the I-EPI-7 and N-EIP-I7 MSDK libraries. Differential Tag (P < 0.05)Virtual Observed I-EPI-7 N-EPI-I7 Tag Variety Ratio Tag Copy RatioN-EPI-I7/ Chr Tags Tags Variety Copies Variety Copies I-EPI-7/N-EPI-I7I-EPI-7/N-EPI-I7 I-EPI-7 > N-EPI-I7 I-EPI-7  1 551 273 265 3330 98 4962.704 6.714 28 5  2 473 192 183 1979 62 517 2.952 3.828 11 4  3 349 153142 1792 58 535 2.448 3.350 8 2  4 281 122 118 1595 42 244 2.810 6.53715 0  5 334 136 126 1296 55 399 2.291 3.248 7 3  6 338 130 120 994 50245 2.400 4.057 1 0  7 403 193 186 1757 61 340 3.049 5.168 7 3  8 334141 137 1327 51 300 2.686 4.423 6 3  9 349 153 145 1370 60 405 2.4173.383 3 3 10 387 158 149 1599 59 378 2.525 4.230 7 1 11 379 169 161 143469 327 2.333 4.385 6 1 12 299 127 121 1060 49 331 2.469 3.202 5 4 13 13853 51 474 20 108 2.550 4.389 1 1 14 228 96 91 838 28 165 3.250 5.079 5 015 260 116 108 936 40 158 2.700 5.924 8 0 16 340 145 137 1355 55 2792.491 4.857 15 3 17 400 196 191 1952 70 496 2.729 3.935 7 4 18 181 72 69527 19 125 3.632 4.216 1 0 19 463 173 165 1711 83 388 1.988 4.410 8 1 20236 95 90 1009 38 244 2.368 4.135 4 0 21 71 24 24 255 8 69 3.000 3.696 20 22 217 88 85 781 31 205 2.742 3.810 3 0 X 185 55 53 462 19 116 2.7893.983 1 0 Y 9 Matches 7205 3060 2917 29833 1125 6870 2.593 4.343 159 38No Matches 1510 820 6835 930 4463 0.882 1.531 13 32 Total 7205 4570 373736668 2055 11333 1.818 3.236 172 70 The column headings are as indicatedfor Table 1.

TABLE 5 MSDK tags significantly (p < 0.050) differentially present inN-EPI-I7 and I-EPI-7 MSDK libraries and genes associated with the MSDKtags. Position Distance Ratio of AscI of AscI I- site in site SEQ N- I-EPI- relation from tr. ID EPI- EPI- 7/N- to tr. Start MSDK Tag NO. I7 7EPI-I7 P value Chr Gene Description Start (bp) CAACGGAAACAAAAACA 277  4  0 −13 0.029464  1 MMP23A matrix metallopro- 5′ 6922 teinase 23ACAACGGAAACAAAAACA 278  4   0 −13 0.029464  1 HSPC182 HSPC182 protein 5′111089 CCCGCCACGCCGCCCCG 279  0  13  13 0.0158  1 ENO1 enolase 1 3′ 230CTCCAAAAATCCCTTGA 280  5   0 −16 0.046199  1 NBL1 neuroblastoma, sup- 5′158583 pression of tumori- genicity 1 CTCCAAAAATCCCTTGA 281  5   0 −160.046199  1 CAPZB F-actin capping 5′ 64897 protein beta subunitGTGCCGCCGCGGGCGCC 282 11  61   2 0.032251  1 KIAA0478 KIAA0478 gene 5′308006 product GTGCCGCCGCGGGCGCC 283 11  61   2 0.032251  1 WNT4wingless-type MMTV 5′ 733 integration site family CTGCAACTTGGTGCCCC 284 2  22   3 0.027586  1 PRDX1 peroxiredoxin 1 3′ 150 GCCTCTCTGCGCCTGCC285 18  10  −6 0.023961  1 GFI1 growth factor in- 3′ 4842 dependent 1CTCCGTTTTCTTTTGTT 286  4   0 −13 0.029464  1 ALX3 aristaless-like 3′1631 homeobox 3 AGCGCTTGGCGCTCCCA 287  5  54   3 0.002039  1 NPR1natriuretic peptide 3′ 677 receptor A/ guanylate cyclaseTCTGGGGCCGGGTAGCC 288  9 216   7 7.35 × 10⁻¹⁶  1 P66beta transcriptionre- 5′ 117605 pressor p66 beta component of CACCCGCGGGGGTGGGG 289  0  17 17 0.028576  1 IL6R interleukin 6 re- 3′ 898 ceptor isoform 2 precursorCGTGTGTATCTGGGGGT 290  6  51   3 0.007702  1 MUC1 mucin 1, 3′ 188528transmembrane GCAGCGGCGCTCCGGGC 291  9 120   4 1.75 × 10⁻⁷  1 MUC1 mucin1, 3′ 139119 transmembrane TGTTCAGAGCCAGCTTG 292  2  25   4 0.01729  1LMNA lamin A/C isoform 2 3′ 236 CCAGGCTGGCTCACCCT 293  0  27  270.003867  1 HAPLN2 brain link protein- 3′ 4728 1 CCAGGGCCTGGCACTGC 29415  89   2 0.003766  1 IGSF9 immunoglobulin 5′ 393 superfamily, member 9TTCGGGCCGGGCCGGGA 295 17  90   2 0.009369  1 LMX1A LIM homeobox trans-5′ 752 cription factor 1, alpha AGCCCTCGGGTGATGAG  29  7  83   4 4.14× 10⁻⁵  1 LMX1A LIM homeobox trans- 5′ 752 cription factor 1, alphaCATTCCAGTTACAGTTG 297  5  40   2 0.027143  1 GPR161 G protein-coupled 3′198 receptor 161 TCCACAGCGGACGTTCC 298  0  32  32 0.004049  1 TOR3Atorsin family 3, 3′ 100 member A ACATTGTCCTTTTTGCC 299  2  25   40.01729  1 C1orf24 niban protein 3′ 292 CCGAGGGGCCTGGCGCC 300  0  12  120.026152  1 BTG2 B-cell transloca- 3′ 431 tion gene 2 TCCAGGCAGGGCCTCTG301  8  91   4 2.06 × 10⁻⁵  1 BTG2 B-cell transloca- 3′ 431 tion gene 2CCCCCGCGACGCGGCGG  34 10   4  −8 0.039911  1 SOX13 SRY-box 13 5′ 571CCCCCGCGACGCGGCGG  34 10   4  −8 0.039911  1 FLJ40343 hypothetical pro-5′ 31281 tein FLJ40343 TGGATTTGGTCGTCTCC 304  0  25  25 0.005775  1PLXNA2 plexin A2 3′ 428 GCCCCCGTGGCGCCCCG 305  8  97   4 6.47 × 10⁻⁶  1CENPF centromere protein 5′ 51300 F (350/400 kD) GCCCCCGTGGCGCCCCG 306 8  97   4 6.47 × 10⁻⁶  1 PTPN14 protein tyrosine 5′ 589 phosphatase,non- receptor type TCGGTGGTCGCTCGTGG 307  0  19  19 0.019333  1 MGC42493hypothetical pro- 5′ 244931 tein MGC42493 TCGGTGGTCGCTCGTGG 308  0  19 19 0.019333  1 CDC42BPA CDC42-binding pro- 5′ 486 tein kinase alphaisoform A GCTAGGGAAAAACAGGC 309 11  59   2 0.043511  1 MGC42493hypothetical pro- 5′ 244931 tein MGC42493 GCTAGGGAAAAACAGGC 310 11  59  2 0.043511  1 CDC42BPA CDC42-binding pro- 5′ 486 tein kinase alphaisoform A GACGCGCTCCCGCGGGC 311  5  42   3 0.01897  1 WNT3Awingless-type MMTV 5′ 59111 integration site family GACGCGCTCCCGCGGGC312  5  42   3 0.01897  1 WNT9A wingless-type MMTV 5′ 41 integrationsite family CAAAGGAGCTGTGGAGC 313  2  23   4 0.026376  1 TAF5L PCAFassociated 3′ 192 factor 65 beta GAGCGGCCGCCCAGAGC 314  6  61   30.001212  1 TAF5L PCAF associated 3′ 192 factor 65 betaGCCAATGACAGCGGCGG 315  0  17  17 0.009019  1 EGLN1 egl nine homolog 1 3′3449 ATGCGCCCCGCAGCCCC 316 10 138   4 1.24 × 10⁻⁸  1 MGC13186hypothetical pro- 5′ 321138 tein MGC13186 ATGCGCCCCGCAGCCCC 317 10 138  4 1.24 × 10⁻⁸  1 SIPA1L2 signal-induced 5′ 114742 proliferation-associated 1 like CTGGAACCCCGCACACC 318  0  16  16 0.010329  1 FLJ12606hypothetical pro- 5′ 82 tein FLJ12606 GTCCCCGCGCCGCGGCC 319 28  13  −73.05 × 10⁻⁷  2 UBXD4 UBX domain con- 5′ 553390 taining 4GTCCCCGCGCCGCGGCC 320 28  13  −7 3.05 × 10⁻⁷  2 APOB apolipoprotein B 5′2343039 precursor AACTTTTAAAGTTTCCC 321  0  14  14 0.017811  2 UBXD4 UBXdomain con- 5′ 97 taining 4 AACTTTTAAAGTTTCCC 322  0  14  14 0.017811  2APOB apolipoprotein B 5′ 2896332 precursor GCCACCCAAGCCCGTCG 323  0  18 18 0.006642  2 RAB10 ras-related GTP- 5′ 106 binding protein RAB10GCCACCCAAGCCCGTCG 324  0  18  18 0.006642  2 KIF3C kinesin family 5′51464 member 3C CCTTTGCTTCCCTTTCC 325  0  15  15 0.013161  2 CRIM1cysteine-rich 5′ 100 motor neuron 1 CCTTTGCTTCCCTTTCC 326  0  15  150.013161  2 MYADML myeloid-associated 5′ 2630025 differentiationmarker-like CACACAAGGCGCCCGCG 327  4  37   3 0.022534  2 SIX2 sineoculis homeo- 5′ 160394 box homolog 2 TAAGAGTCCAGCAGGCA 328  4   0 −130.029464  2 RTN4 reticulon 4 isoform 5′ 295 C TCATTGCATACTGAAGG 329  2 23   4 0.026376  2 SLC1A4 solute carrier 5′ 335302 family 1, member 4TCATTGCATACTGAAGG 330  2  23   4 0.026376  2 SERTAD2 SERTA domain con-5′ 245 taining 2 GCGCTACACGCCGCTCC 331  3  35   4 0.01477  2 SLC1A4solute carrier 5′ 111 family 1, member 4 GCGCTACACGCCGCTCC 332  3  35  4 0.01477  2 SERTAD2 SERTA domain con- 5′ 335436 taining 2GACGACAGCGCCGCCGC 333  0  18  18 0.006642  2 UXS1 UDP-glucuronate 5′ 66decarboxylase 1 AAATTCCATAGACAACC 334 13   7  −6 0.047343  2 HOXD4 homeobox D4 3′ 1141 GGCGTGGGGAGAGGGGG 335  4  35   3 0.032525  2 ZNF533 zincfinger pro- 5′ 114958 tein 533 GCTGCAGGCACTGGGTT 336  4   0 −13 0.029464 2 ATIC 5-aminoimidazole-4- 5′ 203 carboxamide ribonucleotideGCTGCAGGCACTGGGTT 337  4   0 −13 0.029464  2 ABCA12 ATP-binding cas- 5′173481 sette, sub-family A, member 12 ATGGTGTCGCTGGACAG 338  3  37   40.010034  2 ARPC2 actin related pro- 5′ 94 tein 2/3 complex subunit 2ATGGTGTCGCTGGACAG 339  3  37   4 0.010034  2 IL8RA interleukin 8 re- 5′50063 ceptor alpha GACTTCTGGCAAGGGAG 340  0  17  17 0.028576  2 DOCK10dedicator of cyto- 5′ 208215 kinesis 10 ACTGCATCCGGCCTCGG 341 16  89   20.006496  2 PTMA prothymosin, alpha 5′ 93674 (gene sequence 28)CCTAGCATCTCCTCTTG 342  6   0 −19 0.016381  3 GRM7 glutamate receptor, 5′70 metabotropic 7 isoform b GAGGACTGGGGGCTGGG 343  0  14  14 0.017811  3HRH1 histamine receptor 5′ 98409 H1 CTTTGGCCGAGGCCGAG 344  5   0 −160.010561  3 FGD5 FYVE, RhoGEF and PH 5′ 8578 domain containing 5CGGCGCGTCCCTGCCGG 345 33 146   1 0.005894  3 DKFZp313N0621 hypotheticalpro- 5′ 339665 tein DKFZp313N0621 GAGAAGCCGCCAGCCGG 346  7  49   20.0217  3 PXK PX domain contain- 3′ 346 ing serine/ threonine kinaseCCTGCCTCTGGCAGGGG 347 17  82   1 0.029136  3 PLXNA1 plexin A1 5′ 5386GTTTCTTCTCAATAGCC 348  0  22  22 0.011411  3 FLJ12057 hypothetical pro-5′ 28432 tein FLJ12057 TCCTTGATGAAATGCGC 349  0  14  14 0.017811  3 SSB4SPRY domain- 5′ 434 containing SOCS box protein SSB-4 GCTGGCGATCTGGGGCT350  0  12  12 0.026152  3 MGC40579 hypothetical pro- 3′ 405 teinMGC40579 ACCCTTGGAGGAAGGGG 351  0  12  12 0.026152  3 C3orf21 chromosome3 open 3′ 134 reading frame 21 GGGCGGTGGCGGGGACG 352  0  14  14 0.017811 4 RGS12 regulator of G- 5′ 21007 protein signalling 12 isoform 2CCTGCGCCGGGGGAGGC 353 66 240   1 0.011585  4 ADRA2C alpha-2C-adrenergic3′ 432 receptor ATTTAGGGGTCTGTACC 354  0  15  15 0.013161  4 KIAA0232KIAA0232 gene 5′ 58 product GTCCGTGGAATAGAAGG 355  8  69   3 0.001269  4Not Found GTGGCGCGCTGGCGGGG 356  0  13  13 0.0158  4 RASL1B RAS-likefamily 5′ 202915 11 member B GTGGCGCGCTGGCGGGG 357  0  13  13 0.0158  4USP46 ubiquitin specific 5′ 139 protease 46 CTGCCCAGTACCTGAGG 358  0  18 18 0.006642  4 SLC4A4 solute carrier 5′ 151833 family 4, sodiumbicarbonate CCGCGGATCTCGCCGGT 359  2  25   4 0.01729  4 ASAHLN-acylsphingosine 3′ 67 amidohydrolase-like protein AGCCACCTGCGCCTGGC360 14  81   2 0.007548  4 PAQR3 progestin and 5′ 101 adipoQ receptorfamily member III TGCGGAGAAGACCCGGG 361  2  24   4 0.019587  4 ELOVL6ELOVL family member 3′ 1583 6, elongation of long chainGCTGTCCGCACGCGGCC 362  0  15  15 0.013161  4 SMAD1 Sma- and Mad-re- 5′301087 lated protein 1 GCTGTCCGCACGCGGCC 363  0  15  15 0.013161  4HSHIN1 HIV-1 induced pro- 5′ 5967 tein HIN-1 isoform 1 TGCACGCACACTCTTCC364  2  29   4 0.019901  4 LOC152485 hypothetical pro- 3′ 851 teinLOC152485 GCGTTTGGGGGTGTCGG 365  0  21  21 0.003436  4 LOC152485hypothetical pro- 3′ 851 tein LOC152485 GTGGGGAGGCTGGGGCG 366  0  43  430.00042  4 DCAMKL2 doublecortin and 5′ 1633428 CaM kinase-like 2GTGGGGAGGCTGGGGCG 367  0  43  43 0.00042  4 NR3C2 nuclear receptor 5′3189 subfamily 3, group C, member 2 CTGCACTAAAATATTCG 368  3  29   30.046121  4 MGC45800 hypothetical pro- 5′ 304606 tein LOC90768CTTAGATCTAGCGTTCC 369  6  58   3 0.002127  4 DKFZP564J102 DKFZP564J1025′ 4 protein CCATATTTGCCCAAGCC 370  0  12  12 0.026152  5 EMB embiginhomolog 3′ 410 TGACAGGCGTGCGAGCC 371  2  43   7 0.001198  5 MGC33648hypothetical pro- 5′ 92617 tein MGC33648 TGACAGGCGTGCGAGCC 372  2  43  7 0.001198  5 FLJ11795 hypothetical pro- 5′ 699674 tein FLJ1795CTAGAAAGACAGATTGG 373  0  12  12 0.026152  5 TIGA1 TIGA1 5′ 402673CTAGAAAGACAGATTGG 374  0  12  12 0.026152  5 C5orf13 neuronal protein 5′594 3.1 CTGGGTTGCGATTAGCT 375 23  25  −3 0.018417  5 PPIC peptidylprolyl5′ 62181 isomerase C CGTGGCTCGGATTCGGG 376  0  13  13 0.0158  5 ARHGAP26GTPase regulator 3′ 8 associated with the focal CCAGAGGGTCTTAAGTG 377 11 71   2 0.00663  5 NR3C1 nuclear receptor 3′ 553 subfamily 3, group C,member 1 CTGCGGGAGCTGCGGCC 378  0  17  17 0.028576  5 SGCDdelta-sarcoglycan 5′ 597771 isoform 1 TCCGACAAGAAGCCGCC 379  0  26  260.004502  5 MSX2 msh homeo box 3′ 605 homolog 2 CGTCTCCCATCCCGGGC 380 18 17  −3 0.016276  5 CPLX2 complexin 2 3′ 1498 GCAGAAAAAGCACAAAG 381 11  4  −9 0.026609  5 FLT4 fms-related tyro- 5′ 24508 sine kinase 4isoform 1 GTCAGCGCCGGCCCCAG 382  5  44   3 0.013197  6 EGFL9EGF-like-domain, 3′ 134 multiple 9 ATGAGTCCATTTCCTCG 383 31  40  −30.029841  7 MGC10911 hypothetical pro- 5′ 96664 tein MGC10911GCGAGGGCCCAGGGGTC 384 12  75   2 0.006269  7 SLC29A4 solute carrier 3′67 family 29 (nucleoside GGGGGGGAACCGGACCG 385  0  18  18 0.006642  7ACTB beta actin 3′ 865 AACTTGGGGCTGACCGG 386  0  30  30 0.006104  7AUTS2 autism suscepti- 3′ 1095850 bility candidate 2 CCTTGACTGCCTCCATC387  5   0 −16 0.046199  7 WBSCR17 Williams Beuren 5′ 512 syndromechromosome region 17 CCCAGGCTTGGAATCCC 388  2  23   4 0.026376  7 AP1S1adaptor-related 5′ 107 protein complex 1, sigma 1 TACTTTTAACTGCCTGC 389 0  23  23 0.00317  7 FOXP2 forkhead box P2 5′ 328728 isoform IITACTTTTAACTCCCTGC 390  0  23  23 0.00317  7 PPP1R3A protein phospha- 5′167483 tase 1 glycogen- binding ATTGCATTCTTGAGGGC 391  0  12  120.026152  7 SLC4A2 solute carrier 3′ 10 family 4, anion exchanger,member GAGCTGGCAAGCCTGGG 392  0  14  14 0.017811  7 ASB10 ankyrin repeatand 3′ 11480 SOCS box-containing protein GATGCCACCAGGTTGTG 393 13   7 −6 0.047343  7 HTR5A 5-hydroxytryptamine 5′ 579 (serotonin) recep- tor5A GATGCCACCAGGTTGTG 394 13   7  −6 0.047343  7 PAXIP1L PAXtranscription 5′ 67372 activation domain interacting TCCCGCCGCGCGTTGCC395  0  16  16 0.010329  8 PCM1 pericentriolar 3′ 243 material 1CCCTGTCCTAGTAACGC 396  2  36   6 0.004927  8 DDHD2 DDHD domain con- 3′541 taining 2 CGAGGAAGTGACCCTCG 397  0  14  14 0.017811  8 CHD7chromodomain heli- 5′ 156 case DNA binding protein 7 GCGGGGGCAGCAGACGC398  9   0 −29 0.002372  8 PRDM14 PR domain contain- 3′ 768 ing 14TAACTGTCCTTTCCGTA 399 23   5 −15 6.66 × 10⁻⁹  8 Not FoundTCTGTATTTTCCCGGGG 400  0  22  22 0.011411  8 FAM49B family with se- 5′528 quence similarity 49, member B AAGAGGCAGAACGTGCG 401 34  12  −9 2.68× 10⁻¹⁰  8 KCNK9 potassium channel, 3′ 360 subfamily K, member 9GCCTCAGCCCGCACCCG 402  0  21  21 0.015063  8 DGAT1 diacylglycerol O- 5′84 acyltransferase 1 GACCGGGGCGCAGGGCC 403  0  21  21 0.015063  8 ZNF517zinc finger protein 5′ 130 517 GACCGGGGCGCAGGGCC 404  0  21  21 0.015063 8 RPL8 ribosomal protein 5′ 6362 L8 GTGCGGGCGACGGCAGC 405 12  72   20.010135  9 KLF9 Kruppel-like factor 3′ 995 9 GCCCGCCTGAGCAAGGG 406 44 23  −6 5.46 × 10⁻¹⁰  9 C9orf125 chromosome 9 open 3′ 738 reading frame125 GGTGGAGGCAGGCGGGG 407  0  15  15 0.013161  9 TXN thioredoxin 3′ 266GGCGTTAATAGAGAGGC 408  4   0 −13 0.029464  9 PRDM12 PR domain contain-5′ 5017 ing 12 AGGTTGTTGTTCTTGCA 409 20  14  −5 0.000803  9 PRDM12 PRdomain contain- 3′ 1427 ing 12 AGCCGCGGGCAGCCGCC 410  0  21  21 0.015063 9 BARHL1 BarH-like 1 5′ 87 AGCCACCGTACAAGGCC 411  8  49   2 0.039937 10PFKP phosphofructo- 3′ 1056 kinase, platelet GCGGGCAGCTCGAGGCG 412  0 19  19 0.019333 10 BAMBI BMP and activin 3′ 203 membrane-boundinhibitor GCGGCCGCGGGCAGGGG 413  0  20  20 0.01441 10 TRIM8 tripartitemotif- 5′ 375 containing 8 CCCCGTGGCGGGAGCGG 414 22 119   2 0.001632 10NEURL neuralized-like 5′ 630 CCCCGTGGCGGGAGCGG 415 22 119   2 0.00163210 FAM26A family with se- 5′ 14420 quence similarity 26, member AGCCTGGCTCTCCTTCGC 416  0  15  15 0.013161 10 KIAA1598 KIAA1598 3′ 509AAAAGTAAACAGGTATT 417  4   0 −13 0.029464 10 PLEKHA1 pleckstrin homology5′ 162 domain containing, family A CCGCGCTGAGGGGGGGC 418  0  17  170.028576 10 CTBP2 C-terminal binding 3′ 1219 protein 2 isoform 1TCAGAGGCTGATGGGGC 419  6  52   3 0.006425 10 MGMT O-6-methylguanine- 5′1340765 DNA methyltrans- ferase TCAGAGGCTGATGGGGC 420  6  52   30.006425 10 MKI67 antigen identified 5′ 232 by monoclonal antibody Ki-67CGGAGCCGCCCCAGGGG 421  0  28  28 0.009196 11 RNH ribonuclease/ 3′ 381angiogenin inhibitor ATGCCACCCCAGGTTGC 422  0  21  21 0.015063 11 OSBPL5oxysterol-binding 3′ 397 protein-like pro- tein 5 isoformGCGCTGCCCTATATTGG 423 11  75   2 0.00341 11 FLJ11336 hypothetical pro-3′ 375 tein FLJ11336 TCGTCCTGGGTGGAGGG 424  2  22   3 0.027586 11C11ORF4 chromosome 11 hy- 5′ 458 pothetical protein ORF4TCGTCCTGGGTGGAGGG 425  2  22   3 0.027586 11 BAD BCL2-antagonist 5′ 708of cell death protein GCCTCTGCAGCCAGGTG 426  6   0 −19 0.005543 11 DRAP1DR1-associated 3′ 368 protein 1 CCACAGACCAGTGGGTG 427  6  42   20.037507 11 TPCN2 two pore segment 3′ 305 channel 2 CCCCGGCAGGCGGCGGC428 17  89   2 0.010843 11 ROBO3 roundabout, axon 5′ 64774 guidancereceptor, homolog 3 CCCCGGCAGGCGGCGGC 429 17  89   2 0.010843 11FLJ23342 hypothetical pro- 5′ 208 tein FLJ23342 GAACAAACCCAGGGATC 430 18 11  −5 0.000558 12 KCNA1 potassium voltage- 5′ 1403 gated channel,shaker-related TCGGAGTCCCCGTCTCC 431  5  56   3 0.001392 12 ANKRD33ankyrin repeat 5′ 73619 domain 33 AGAACGGGAACCGTCCA 432 29  15  −6 6.88× 10⁻⁷ 12 CENTG1 centaurin, gamma 1 3′ 3647 GCCTGGACGGCCTCGGG 433  2  23  4 0.026376 12 CSRP2 cysteine and 3′ 185 glycine-rich pro- tein 2GTGCGGCGCGGCTCAGC 434  0  18  18 0.022346 12 DIP13B DIP13 beta 3′ 6TTGCAAAGAACGGAGCC 435  0  12  12 0.026152 12 CUTL2 cut-like 2 3′ 265TTTCAGCGGGAGCCGCC 436 24  19  −4 0.000698 12 KIAA1853 KIAA1853 protein5′ 64 CGAACTTCCCGGTTCCG 437 43  19  −7 4.00 × 10⁻¹¹ 12 Not FoundCAGCGGCCAAAGCTGCC 438 32 129   1 0.03085 12 RAN ras-related nuclear 5′257 protein CAGCGGCCAAAGCTGCC 439 32 129   1 0.03085 12 EPIM epimorphinisoform 5′ 32499 2 GTAGGTGGCGGCGAGCG 440  0  22  22 0.011411 13 USP12ubiquitin-specific 3′ 653 protease 12-like 1 CTGTACATCGGGGCGGC 441  6  0 −19 0.016381 13 SOX1 SRY (sex determin- 5′ 425 ing region Y)-box 1GCTGCTGCCCCCAGCCC 442  0  19  19 0.005254 14 KIAA0323 KIAA0323 3′ 158CGCAGTTCGGAAGGACC 443  0  12  12 0.026152 14 MTHFD1 methylenetetra-hydrofolate 5′ 559 dehydrogenase 1 CGCAGTTCGGAAGGACC 444  0  12  120.026152 14 ESR2 estrogen receptor 2 5′ 93455 CTGAGGCTGCGCCCGCC 445  0 12  12 0.026152 14 GPR68 G protein-coupled 5′ 164030 receptor 68GGGCGGTGCCGCCAGTC 446  3  49   5 0.000941 14 EML1 echinoderm micro- 5′62907 tubule associated protein like 1 GCCCCACGCCCCCTGGC 447  9  65   20.00516 14 C14orf153 chromosome 14 open 5′ 681 reading frame 153GCCCCACGCCCCCTGGC 448  9  65   2 0.00516 14 BAG5 BCL2-associated 5′ 19athanogene 5 CTCGTGCGAGTCGCGCG 449  0  17  17 0.028576 15 NDNL2necdin-like 2 5′ 405209 GCCCCGGCCGCCGCGCC 450  4  38   3 0.018724 15 NotFound AGAGCTGAGTCTCACCC 451  5  45   3 0.01099 15 CDAN1 codanin 1 3′ 359GAGCCTCTTATGGCTCG 452  0  12  12 0.026152 15 RORA RAR-related orphan 3′205 receptor A isoform c TCAGGCTTCCCCTTCGG 453 15  81   2 0.012835 15PIAS1 protein inhibitor 5′ 190450 of activated STAT, 1 GCCGGGCCCCGCCCTGC454  0  21  21 0.015063 15 C15orf17 chromosome 15 open 5′ 295 readingframe 17 CCTTGAGAGCAGAGAGC 455  6  41   2 0.044419 15 LRRN6Aleucine-rich repeat 3′ 43 neuronal 6A CTAAGTGGGCAGCACTG 456  0  19  190.005254 15 ARNT2 aryl-hydrocarbon 3′ 128 receptor nuclear translocatorGGCCGGGCTGGCACCGG 457  0  19  19 0.005254 16 TMEM8 transmembrane pro- 3′496 tein 8 (five membrane-spanning GGTGCAGCTCTGAGGCG 458  0  44  440.000342 16 RHOT2 ras homolog gene 5′ 119 family, member T2GAGTGCCCGGCTCGCCC 459  0  18  18 0.022346 16 C1QTNF8 C1q and tumor ne-3′ 5691 crosis factor related protein 8 CCCGCGGGAGAGACCGG 460  5  48   30.006311 16 E4F1 p120E4F 5′ 8954 CCCGCGGGAGAGACCGG 461  5  48   30.006311 16 MGC21830 hypothetical pro- 5′ 3623 tein MGC21830CGCAGTGTCCTAGTGCC 462  0  24  24 0.002455 16 CGI-14 CGI-14 protein 5′ 89GAGCTCAGAGCTCCTCC 463  0  20  20 0.00615 16 CGI-14 CGI-14 protein 5′ 89CCTTCCTGCGAACCCCT 464  0  13  13 0.0158 16 MMP25 matrix metallo- 3′11905 proteinase 25 CGGGCCGGGTCGGCCTC 465  0  41  41 0.000635 16NUDT16L1 nudix-type motif 5′ 110 16-like 1 GTGGCGCTCGGGGTGCG 466  0  13 13 0.0158 16 PPL periplakin 5′ 283 CCGGGTCCGCGGGCGAG 467 14 123   35.66 × 10⁻⁶ 16 USP7 ubiquitin specific 3′ 725 protease 7 (herpesATCCGGCCAAGCCCTAG 468  8  62   2 0.004442 16 ATF7IP2 activating trans-5′ 244550 cription factor 7 interacting ATCCGGCCAAGCCCTAG 469  8  62   20.004442 16 GRIN2A N-methyl-D- 5′ 809 aspartate receptor subunit 2AGTTAAAAACTTCCAGCC 470  0  12  12 0.026152 16 DNAH3 dynein, axonemal, 3′895 heavy polypeptide 3 GGGTAGGCACAGCCGTC 471  4  61   5 0.000219 16TBX6 T-box 6 isoform 1 5′ 85 TGCGCGCGTCGGTGGCG 472  4  45   3 0.00499116 LOC51333 mesenchymal stem 3′ 9832 cell protein DSC43CGGTGCCCGGGAGGCCC 473  4   0 −13 0.029464 16 CHD9 chromodomain heli- 5′2004600 case DNA binding protein 9 CGGTGCCCGGGAGGCCC 474  4   0 −130.029464 16 SALL1 sal-like 1 5′ 654 GTGCAGTCTCGGCCCGG 475  2  43   70.001198 16 FBXL8 F-box and leucine- 3′ 3905 rich repeat protein 8TCCCGCGCCCAGGCCCC 476  9   0 −29 0.002372 16 ZCCHC14 zinc finger, CCHC3′ 143 domain containing 14 GCAGCCCCTTGGTGGAG 477 21   8  −8 2.32 × 10⁻⁶16 TUBB3 tubulin, beta, 4 3′ 843 CCGTGTTGTCCTGGCCG 478  3  40   40.00559 17 MNT MAX binding protein 3′ 228 CCACACCTCTCTCCAGG 479  0  18 18 0.006642 17 SENP3 SUMO1/sentrin/SMT3 5′ 326 specific protease 3GGCAACCACTCAGGACG 480  2  51   8 0.000235 17 HCMOGT-1 sperm antigen 3′69709 HCMOGT-1 CACAGCCAGCCTCCCAG 213 23   9  −8 8.64 × 10⁻⁷ 17 LHX1 LIMhomeobox pro- 3′ 3701 tein 1 CCAAGGAACCTGAAAAC 482  0  14  14 0.01781117 ACLY ATP citrate lyase 3′ 446 isoform 1 GCCCAAAAGGAGAATGA 483  6   0−19 0.016381 17 PHOSPHO1 phosphatase, orphan 3′ 5786 1 CACGCCACCACCCACCC484  0  16  16 0.010329 17 NXPH3 neurexophilin 3 5′ 318GAAACCCCTCTGAGCCC 485  0  17  17 0.028576 17 ABC1 amplified in breast 3′235 cancer 1 GTGACCAGCCTGGAGAG 486 15  14  −3 0.030075 17 SDK2 sidekick2 5′ 206723 CTGAATGGGGCAAGGAG 487 48  40  −4 1.40 × 10⁻⁶ 17 ENPP7ectonucleotide 5′ 628261 pyrophosphatase/ phosphodiesteraseCCCCAGGCCGGGTGTCC 303  9  58   2 0.016753 17 CBX8 chromobox homolog 8 5′16730 CCCCGACCCCAGGCGGG 489  0  19  19 0.005254 18 RNF152 ring fingerprotein 5′ 1155 152 TAAACTCTTTTCCTGTT 490  0  12  12 0.026152 19 PIAS4protein inhibitor 5′ 17748 of activated STAT, 4 TAAACTCTTTTCCTGTT 491  0 12  12 0.026152 19 EEF2 eukaryotic trans- 5′ 4554 lation elongationfactor 2 ACCCTCGCGTGGGCCCC 492 16  98   2 0.001595 19 ZNF136 zinc fingerprotein 5′ 89 136 (clone pHZ-20) ACCCTCGCGTGGGCCCC 493 16  98   20.001595 19 ZNF625 zinc finger protein 5′ 6300 625 TCCGGGGCCCCGCCCCC 494 0  13  13 0.0158 19 KLF1 Kruppel-like factor 3′ 1241 1 (erythroid)CGCCCCGGTGCCCAACG 495 16  75   1 0.048103 19 PKN1 protein kinase N1 5′13821 isoform 2 CGCCCCGGTGCCCAACG 496 16  75   1 0.048103 19 DDX39 DEAD(Asp-Glu-Ala- 5′ 173 Asp) box polypep- tide 39 AGCCTGCAAAGGGGAGG 497 18 83   1 0.039473 19 AKAP8L A kinase (PRKA) 5′ 13794 anchor protein 8-like TCCCTGTCCCTGCAATC 498  5   0 −16 0.046199 19 SPTBN4 spectrin, beta,3′ 52746 non-erythrocytic 4 CCCGCTCCTTCGGTTCG 499 14  73   2 0.025146 19ITPKC inositol 1,4,5- 5′ 273 trisphosphate 3- kinase C CCCGCTCCTTCGGTTCG500 14  73   2 0.025146 19 ADCK4 aarF domain con- 5′ 134 taining kinase4 TTGGGTTCGCTCAGCGG 501  6  52   3 0.006425 19 ASE-1 CD3-epsilon- 5′1320 associated protein; antisense to TTGGGTTCGCTCAGCGG 502  6  52   30.006425 19 PPP1R13L protein phospha- 5′ 11721 tase 1, regulatory(inhibitor) GCTGCGGCCGGCCGGGG 503  0  20  20 0.01441 19 UBE2S ubiquitincarrier 5′ 478 protein GACAGACCCGGTCCCTG 504  0  12  12 0.026152 20RRBP1 ribosome binding 3′ 270 protein 1 CGCTCCCACGTCCGGGA 505  3  35   40.01477 20 SNTA1 acidic alpha 1 3′ 288 syntrophin CTTTCAAACTGGACCCG 506 3  30   3 0.038252 20 Not Found GGGGATTCTACCCTGGG 507 20 100   20.009572 20 ARFGEF2 ADP-ribosylation 5′ 93944 factor guanineGGGGATTCTACCCTGGG 508 20 100   2 0.009572 20 PREX1 PREX1 protein 5′ 62TGTCACAGACTCCCAGC 509  5  39   2 0.032404 21 USP25 ubiquitin specific 5′664846 protease 25 TGTCACAGACTCCCAGC 510  5  39   2 0.032404 21 NRIP1receptor interact- 5′ 96802 ing protein 140 TGGGCTGCTGTCGGGGG 511  0  14 14 0.017811 21 CLIC6 chloride intracel- 3′ 868 lular channel 6CGCGCGCAGCGGGCGCC 512  0  13  13 0.0158 22 EIF3S7 eukaryotic transla- 5′51 tion initiation factor 3 GCCCTGGGGTGTTATGG 513  0  22  22 0.011411 22FLJ27365 FLJ27365 protein 5′ 13829 GCCCTGGGGTGTTATGG 514  0  22  220.011411 22 FLJ10945 hypothetical pro- 5′ 18029 tein FLJ10945CCCCTTCTCAGCTCCGG 515  0  12  12 0.026152 22 TUBGCP6 tubulin, gamma 5′73 complex associated protein 6 ATTTACACGGGGCTCAC 516  0  13  13 0.015823 STAG2 stromal antigen 2 5′ 1402 The column headings are as in Table 2except that the MSDK libraries compared are the N-EPI-I7 and I-EPI-7libraries (see Table 3 for details of the tissues from which theselibraries were made).

Although statistically significant differences were observed, a moresimilar pattern was observed in the comparison of normal and tumorfibroblast-enriched stroma (Tables 6-8).

TABLE 6 Chromosomal location and analysis of the frequency of MSDK tagsin the I-STR-I7 and I-STR-7 MSDK libraries. Differential Tag Tag VarietyRatio Tag Copy Ratio (P < 0.05) Virtual Observed N-STR-I7 I-STR-7I-STR-7/ I-STR-7/ I-STR-7 > N-STR-I7 > Chr Tags Tags Variety CopiesVariety Copies N-STR-I7 N-STR-I7 N-STR-I7 I-STR-7  1 551 197 55 315 1901877 3.455 5.959 43 0  2 473 140 47 325 134 1576 2.851 4.849 31 0  3 349124 38 309 120 1437 3.158 4.650 24 0  4 281 89 28 126 85 788 3.036 6.25421 0  5 334 104 45 274 98 1170 2.178 4.270 19 0  6 338 99 31 138 95 8253.065 5.978 16 0  7 403 134 43 162 131 1094 3.047 6.753 28 1  8 334 11130 131 107 928 3.567 7.084 24 0  9 349 127 36 277 124 1125 3.444 4.06127 0 10 387 126 39 202 121 1009 3.103 4.995 23 0 11 379 121 40 204 116870 2.900 4.265 15 0 12 299 106 33 179 102 856 3.091 4.782 17 1 13 13843 18 87 39 414 2.167 4.759 5 0 14 228 67 24 129 65 585 2.708 4.535 10 015 260 80 22 102 77 552 3.500 5.412 11 0 16 340 113 40 189 104 802 2.6004.243 15 1 17 400 160 50 385 152 1550 3.040 4.026 27 0 18 181 54 18 10149 417 2.722 4.129 6 0 19 463 148 44 193 141 1053 3.205 5.456 24 1 20236 71 18 132 69 771 3.833 5.841 19 0 21 71 21 9 35 20 187 2.222 5.343 40 22 217 68 20 165 67 630 3.350 3.818 7 0 X 185 51 19 75 47 408 2.4745.440 12 1 Y 9 Matches 7205 2354 747 4235 2253 20924 3.016 4.941 428 5No Matches 3343 2771 14479 796 7166 0.287 0.495 62 397 Total 7205 56973518 18714 3049 28090 0.867 1.501 490 402 The column headings are asindicated for Table 1.

TABLE 7 MSDK tags significantly (p <0.050) differentially present inN-STR-I7 and I-STR-7 MSDK libraries and genes associated with the MSDKtags. Ra- tio Position Distance I- of AscI of AscI STR- site in site SEQN- I- 7/N- relation from tr. ID STR- STR- STR- to tr. Start MSDK Tag NO.I7 7 I7 P value Chr Gene Description Start (bp) AGTCCCCAGGGCTGGCA  517 9  30   2 0.03582  1 HES5 hairy and enhancer of 5′ 16528 split 5ATTAACCTTTGAAGCCC  518  0  17  17 0.00238  1 SHREW1 transmembraneprotein 3′ 687 SHREW1 GGGCTGCCTCGCCGGGC  519 11  34   2 0.03524  1 ESPNespin 5′ 5344 GGGCTGCCTCGCCGGGC  520 11  34   2 0.03524  1 RP1-120G22.10brain acyl-CoA hydrolase 5′ 25682 isoform hBACHa/X GAAATGCTAAGGGGTTG 521  4  37   6 7.3 ×  1 PIK3CD phosphoinositide-3-ki- 5′ 39 10⁻⁵ nase,catalytic, delta TAAATTCCACTGAAAAT  522  0   7   7 0.01683  1 PAX7paired box gene 7 3′ 9827 isoform 1 GTGCCGCCGCGGGCGCC  523  4  31   50.00032  1 KIAA0478 KIAA0478 gene product 5′ 308006 GTGCCGCCGCGGGCGCC 524  4  31   5 0.00032  1 WNT4 wingless-type MMTV in- 5′ 733 tegrationsite family, AAAATGTTCTCAAACCC  525  0  11  11 0.00359  1 ARID1A AT richinteractive do- 5′ 75135 main 1A (SWI- like) AGCACCCGCCTGGAACC  526  6 21   2 0.03859  1 PTPRF protein tyrosine phos- 3′ 727 phatase, receptortype, F GCTCACCTACCCAGGTG  527  3  44  10 2 ×  1 Not Found 10⁻⁶GCAGGTAGACCAGGCCT  528  2  15   5 0.01234  1 GLIS1 GLIS family zincfinger 5′ 4943 1 CAGCTTTTGAAATCAGG  529  8  34   3 0.00589  1 KIAA1579hypothetical protein 5′ 196 FLJ10770 GCCTCTCTGCGCCTGCC  530  8  28   20.03562  1 GFI1 growth factor 3′ 4842 independent 1 CGCAGAATCCCGGAGGC 531  0   8   8 0.01239  1 EVI5 ecotropic viral integra- 3′ 7704 tionsite 5 CCCGGACTTGGCCAGGC  532 34 120   2 1 ×  1 NHLH2 nescient helixloop 3′ 2971 10⁻⁶ helix 2 AGCGCTTGGCGCTCCCA  533  3  18   4 0.00867  1NPR1 natriuretic peptide re- 3′ 677 ceptor A/guanylate cyclaseGCCCAACCCCGGGGAGT  534  3  21   5 0.0037  1 P66beta transcriptionrepressor 5′ 117605 p66 beta component of TCTGGGGCCGGGTAGCC  535 15  54  2 0.00125  1 P66beta transcription repressor 5′ 117605 p66 betacomponent of CGTGTGTATCTGGGGGT  536  3  17   4 0.01446  1 MUC1 mucin 1,transmembrane 3′ 188528 GCAGCGGCGCTCCGGGC  537  4  54   9 0  1 MUCImucin 1, transmembrane 3′ 139119 GATCCTCGCCCGCGCCT  538  0  20  200.00085  1 EFNA4 ephrin A4 isoform a 3′ 365 CCGGTTTCCCAGCGCCC  539  0  9   9 0.00623  1 MUC1 mucin 1, transmembrane 3′ 111426CTGCTCGGGGGACCCCC  540  0   9   9 0.00623  1 MTX1 metaxin 1 isoform 1 3′304 GGCGCCGCCATCTTGCC  541  0   9   9 0.00623  1 MTX1 metaxin 1 isoform1 3′ 304 CCAGGGCCTGGCACTGC  542 13 101   5 0  1 IGSF9 immunoglobulinsuper- 5′ 393 family, member 9 TTCGGGCCGGGCCGGGA  543 21  68   2 0.00073 1 LMX1A LIM homeobox transcrip- 5′ 752 tion factor 1, alphaAGCCCTCGGGTGATGAG   29 13  56   3 0.00019  1 LMX1A LIM homeoboxtranscrip- 5′ 752 tion factor 1, alpha GAGGGGGGCAAAACTAC  545  0  12  120.00296  1 SCYL3 SCY1-like 3 isoform 1 3′ 561 CTTATGTTTACAGCATC  546  2 15   5 0.01234  1 PAPPA2 pappalysin 2 isoform 2 5′ 255915CTTATGTTTACAGCATC  547  2  15   5 0.01234  1 RFWD2 ring finger and WDre- 5′ 21 peat domain 2 isoform a TATTTGGTGCTGCCACA  548  0   7   70.01683  1 LHX4 LIM homeobox protein 4 3′ 5084 TCTCCTTGCTCGCTCCG  549  0 13  13 0.00244  1 XPR1 xenotropic and polytro- 5′ 128896 pic retrovirusreceptor TCTCCTTGCTCGCTCCG  550  0  13  13 0.00244  1 ACBD6acyl-Coenzyme A binding 5′ 797 domain containing 6 GTTCTCAAACAGCTTTC 551  0  16  16 0.0031  1 IPO9 importin 9 3′ 343 TCCAGGCAGGGCCTCTG  55211  54   3 8.4 ×  1 BTG2 B-cell translocation 3′ 431 10⁻⁵ gene 2TCAGATAGTTCTCCAGC  553  0   8   8 0.01239  1 NFASC neurofascin isoform 45′ 19 TCAGATAGTTCTCCAGC  554  0   8   8 0.01239  1 LRRN5 leucine richrepeat 5′ 143165 neuronal 5 precursor ACGTTTTTAACTACACA  555  0  20  200.00024  1 ELK4 ELK4 protein isoform a 3′ 621 CTGTCCAACTCCCAGGG  556  0 16  16 0.00081  1 MAPKAPK2 mitogen-activated pro- 3′ 1117 teinkinase-activated TGGATTTGGTCGTCTCC  557  0   8   8 0.01239  1 PLXNA2plexin A2 3′ 428 GCCCCCGTGGCGCCCCG  558 16  57   2 0.00095  1 CENPFcentromere protein F 5′ 51300 (350/400 kD) GCCCCCGTGGCGCCCCG  559 16  57  2 0.00095  1 PTPN14 protein tyrosine phos- 5′ 589 phatase,non-receptor type CCACACCAGGATTCGAG  560  0   7   7 0.01683  1 HSPC163HSPC163 protein 3′ 375 GTGAACTTCCAAGATGC  561  7  26   2 0.01495  1CNIH3 comichon homolog 3 3′ 50 GCTAGGGAAAAACAGGC  562  2  32  11 5.5 × 1 MGC42493 hypothetical protein 5′ 244931 10⁻⁵ MGC42493GCTAGGGAAAAACAGGC  563  2  32  11 5.5 ×  1 CDC42BPA CDC42-bindingprotein 5′ 486 10⁻⁵ kinase alpha isoform A GACGCGCTCCCGCGGGC  564  0  16 16 0.00081  1 WNT3A wingless-type MMTV inte- 5′ 59111 gration sitefamily GACGCGCTCCCGCGGGC  565  0  16  16 0.00081  1 WNT9A wingless-typeMMTV inte- 5′ 41 gration site family GAGCGGCCGCCCAGAGC  566  7  39   40.00054  1 TAF5L PCAF associated factor 3′ 192 65 beta ATGCGCCCCGCAGCCCC 567 16  76   3 3 ×  1 MGC13186 hypothetical protein 5′ 321138 10⁻⁶MGC13186 ATGCGCCCCGCAGCCCC  568 16  76   3 3 ×  1 SIPA1L2 signal-inducedprolif- 5′ 114742 10⁻⁶ eration-associated 1 like CTCTCACCCGAGGAGCG  569 0  10  10 0.00467  2 OACT2 O-acyltransferase (mem- 3′ 47 brane bound)domain GTTCCTGCTCTCCACGA  570  3  19   4 0.00645  2 KLF11 Kruppel-likefactor 11 3′ 387 GTCCCCGCGCCGCGGCC  571 29  67   2 0.03072  2 UBXD4 UBXdomain containing 4 5′ 553390 GTCCCCGCGCCGCGGCC  572 29  67   2 0.03072 2 APOB apolipoprotein B 5′ 2343039 precursor CTTTTGTCCCTTTTGTC  573  0 23  23 0.00028  2 ADCY3 adenylate cyclase 3 5′ 619 GCCACCCAAGCCCGTCG 574  0   9   9 0.00623  2 RAB10 ras-related GTP-binding 5′ 106 proteinRAB10 GCCACCCAAGCCCGTCG  575  0   9   9 0.00623  2 KIF3C kinesin familymember 3C 5′ 51464 ACCTTAGGCCCTTCTCT  576  0  11  11 0.00359  2 FOSL2FOS-like antigen 2 5′ 2425 ATGCGAGGGGCGCGGTA  577 18  80   3 3 ×  2FLJ32954 hypothetical protein 5′ 277913 10⁻⁶ FLJ32954 ATGCGAGGGGCGCGGTA 578 18  80   3 3 ×  2 CDC42EP3 Cdc42 effector protein 3 5′ 366 10⁻⁶GATTCTGTCTATGCTTC  579  2  21   7 0.00133  2 THUMPD2 THUMP domaincontaining 5′ 16 2 GCAGCATTGCGGCTCCG  580 19 157   6 0  2 SIX2 sineoculis homeobox 5′ 160394 homolog 2 CACACAAGGCGCCCGCG  581  6  29   30.00299  2 SIX2 sine oculis homeobox 5′ 160394 homolog 2TCATTGCATACTGAAGG  582  2  18   6 0.00391  2 SLC1A4 solute canier family1, 5′ 335302 member 4 TCATTGCATACTGAAGG  583  2  18   6 0.00391  2SERTAD2 SERTA domain containing 5′ 245 2 CTGGAGCTCAGCACTGA  584  0  12 12 0.00296  2 Not Found TTCACCCCCACCCACTC  585  0  15  15 0.00413  2Not Found CCCCAGCTCGGCGGCGG  586 63 195   2 0  2 TCF7L1 HMG-boxtranscription 3′ 859 factor TCF-3 AGGGCAATCCAGCCCTC  587  0  13  130.00923  2 LOC51315 hypothetical protein 3′ 197 LOC51315AAGCAGTCTTCGAGGGG  588  7  61   6 0  2 CNNM3 cyclin M3 isoform 1 5′ 396CGGTGGGGTAGGCGGTC  589  0  13  13 0.00923  2 SEMA4C semaphorin 4C 3′ 336AGAGTGACGTGCTGTGG  590  0  12  12 0.00296  2 MERTK c-mer proto-oncogene3′ 281 tyrosine kinase CACCAAACCTAGAAGGC  591  4  24   4 0.00251  2 GLI2GLI-Kruppel family mem- 5′ 56228 ber GLI2 isoform alphaCACCAAACCTAGAAGGC  591  4  24   4 0.00251  2 FLJ14816 hypotheticalprotein 5′ 269933 FLJ14816 TCCCCATTTCACCAAGG  593  0   7   7 0.01683  2PTPN18 protein tyrosine phos- 3′ 187 phatase, non-receptor typeGGCGAGGGGGCCTCTGG  594  2  13   4 0.02369  2 FLJ38377 hypotheticalprotein 3′ 593 FLJ38377 AGACCATCCTTGGACCC  595  3  41   9 6 ×  2 B3GALT1UDP-Gal: betaGlcNAc beta 5′ 524869 10⁻⁶ GGCGCCAGAGGAAGATC  596  8  30  2 0.01991  2 SSB autoantigen La 5′ 29950 TGTAAGGCGGCGGGGAG  597 18  55  2 0.00496  2 SP3 Sp3 transcription factor 3′ 1637 AAATTCCATAGACAACC 598  0  14  14 0.00122  2 HOXD4 homeo box D4 3′ 1141 ATGGTGTCGCTGGACAG 599  0  14  14 0.00122  2 ARPC2 actin related protein 5′ 94 2/3 complexsubunit 2 ATGGTGTCGCTGGACAG  600  0  14  14 0.00122  2 IL8RA interleukin8 receptor 5′ 50063 alpha TCACATTTCAGTTTGGG  601  4  24   4 0.00251  2COL4A4 alpha 4 type IV collagen 3′ 339 precursor ACTGCATCCGGCCTCGG  60210  48   3 0.00028  2 PTMA prothymosin, alpha 5′ 93674 (gene sequence28) CACCCGCGGTGCCGGGC  603 13  40   2 0.02012  2 PTMA prothymosin, alpha3′ 2352 (gene sequence 28) GGGTCTTCATCTGATCC  604  6  25   3 0.01087  2FLJ43879 FLJ43879 protein 5′ 109293 GGGTGGGGGGTGCAGGC  605  0  17  170.00068  2 FLJ22671 hypothetical protein 5′ 144084 FLJ22671CAGCCGACTCTCTGGCT  606  0  35  35 1 ×  3 DTYMK deoxythymidylate kinase5′ 2784474 10⁻⁶ (thymidylate kinase) CCTAGCATCTCCTCTTG  607  0   7   70.01683  3 GRM7 glutamate receptor, 5′ 70 metabotropic 7 isoform bCTATACTGGCTCGTCCT  608  0  13  13 0.00244  3 SLC6A11 solute carrierfamily 6 5′ 108592 (neurotransmitter CTATACTGGCTCGTCCT  609  0  13  130.00244  3 ATP2B2 plasma membrane calcium 5′ 257778 ATPase 2 isoform bGAGGACTGGGGGCTGGG  610  0  10  10 0.03148  3 HRH1 histamine receptor H15′ 98409 GGAGGCAAACGGGAACC  611  5  19   3 0.03849  3 IQSEC1 IQ motifand Sec7 domain 5′ 315433 1 CCCGACGGGCGGCGCGG  612  0   7   7 0.01683  3DLEC1 deleted in lung and eso- 5′ 9380 phageal cancer 1 isoformCCCGACGGGCGGCGCGG  613  0   7   7 0.01683  3 PLCD1 phospholipase C,delta 1 5′ 200 GATCGCTGGGGTTTTGG  614  5  38   5 0.00013  3 DLEC1deleted in lung and eso- 5′ 9380 phageal cancer 1 isoformGATCGCTGGGGTTTTGG  615  5  38   5 0.00013  3 PLCD1 phospholipase C,delta 1 5′ 200 CGGCGCGTCCCTGCCGG  616 61 140   2 0.00079  3DKFZp313N0621 hypothetical protein 5′ 339665 DKFZp313N0621CCACTTCCCCATTGGTC  617 37 132   2 0  3 ARMET arginine-rich, mutated 5′633 in early stage tumors CACACCCCGCCCCCAGC  618 24  74   2 0.00071  3ACTR8 actin-related protein 8 3′ 338 AACCCCGAAACTGGAAG  619  2  19   60.00296  3 FAM19A4 family with sequence 5′ 143 similarity 19 (chemokine)GAAGAGTCCCAGCCGGT  620  0  52  52 0  3 MDS010 x 010 protein 5′ 5211GAAGAGTCCCAGCCGGT  621  0  52  52 0  3 TMEM39A tranamembrane protein 5′116 39A CAACCCCAACCGCGTTC  622  7  56   5 1 ×  3 MUC13 mucin 13,epithelial 5′ 120784 10⁻⁶ transmembrane CCTGCCTCTGGCAGGGG  623 16 100  4 0  3 PLXNA1 plexin A1 5′ 5386 GCGTTGGGCACCCCTGC  624  0   7   70.01683  3 Not Found GCCTAGAAGAAGCCGAA  625  8  50   4 2.9 ×  3 RAB43RAB41 protein 5′ 577 10⁻⁵ GGGCCGAGTCCGGCAGC  626  6  32   4 0.00258  3CHST2 carbohydrate (N- 3′ 61 acetylglucosamine-6-O) GAAAGGGCAGTCCCGCC 627  0  18  18 0.00185  3 ZIC1 zinc finger protein of 5′ 155 thecerebellum 1 GAAAGGGCAGTCCCGCC  628  0  18  18 0.00185  3 ZIC4 zincfinger protein of 5′ 2618 the cerebellum 4 CTCGGTGGCGGGACCGG  629  8  26  2 0.02912  3 SCHIP1 schwannomin interacting 3′ 490368 protein 1GCCGGGCCGGTGACTCC  630  2  41  14 2 ×  3 FLJ22595 hypothetical protein5′ 111198 10⁻⁶ FLJ22595 GCCGGGCCGGTGACTCC  631  2  41  14 2 ×  3 KPNA4karyopherin alpha 4 5′ 372 10⁻⁶ CCCAGAGACTTTATCCT  632  0   9   90.00623  3 FNDC3B fibronectin type III 5′ 856 domain containing 3BCCCAGAGACTTTATCCT  633  0   9   9 0.00623  3 PLD1 phospholipase D1, 5′301657 phophatidylcholine- specific CGTGTGAGCTCTCCTGC  634 15 105   5 0 3 EPHB3 ephrin receptor EphB3 3′ 576 precursor TCTCAACACGCTAGGCA  635 3  22   5 0.00215  3 Not Found GGTACCTGCATCCTCTC  636  0  10  100.03148  3 HES1 hairy and enhancer of 5′ 1004 split 1 GGAAGCGCCCTGCCCTC 637  0  18  18 0.00035  4 Not Found CACTTCCCAGCTCTGAG  638  2  17   60.0052  4 FGFR3 fibroblast growth factor 5′ 26779 receptor 3 isoform 1CACCTCTGCCGTGCTGC  639  0  45  45 0  4 RNF4 ring finger protein 4 5′ 176CACCTCTGCCGTGCTGC  640  0  45  45 0  4 ZFYVE28 zinc finger, FYVE domain5′ 50261 containing 28 GGGCGGTGGCGGGGACG  641  0  12  12 0.00296  4RGS12 regulator of G-protein 5′ 21007 signalling 12 isoform 2GCTCTGGGCGCCCTTTC  642  7  52   5 6 ×  4 RGS12 regulator of G-protein 5′21007 10⁻⁶ signalling 12 isoform 2 CCTGCGCCGGGGGAGGC  643 39 119   2 1.1×  4 ADRA2C alpha-2C-adrenergic 3′ 432 10⁻⁵ receptor TACAATGAAGGGGTCAG 644  4  22   4 0.00554  4 STK32B serine/threonine kinase 5′ 28 32BTACAATGAAGGGGTCAG  645  4  22   4 0.00554  4 CYTL1 cytokine-like 1 5′32301 GCATTGATTGCTGTCCC  646  0   9   9 0.00623  4 MAIN2B2 mannosidase,alpha, 5′ 11294 class 2B, member 2 GCATTGATTGCTGTCCC  647  0   9   90.00623  4 PPP2R2C gamma isoform of regul- 5′ 91597 atory subunit B55,protein GTCCGTGGAATAGAAGG  648  0  18  18 0.00185  4 Not FoundACGCCGGCGCCGCTCGC  649  0   7   7 0.01683  4 FLJ13197 hypotheticalprotein 3′ 1219 FLJ13197 AAAGCACAGGCTCTCCC  650  2  14   5 0.0165  4SLC4A4 solute carrier family 4, 5′ 151833 sodium bicarbonateCCGCGGATCTCGCCGGT  651  5  24   3 0.00765  4 ASAHL N-acylsphingosineamido- 3′ 67 hydrolase-like protein AGCCACCTGCGCCTGGC  652 12  52   30.00033  4 PAQR3 progestin and adipoQ 5′ 101 receptor family member IIICAAGGGTTCACATATGC  653  0   8   8 0.01239  4 WDFY3 WD repeat and FYVEdo- 3′ 249 main containing 3 isoform CGCTTCGGGGTGCATCT  654  0  12  120.00296  4 PDHA2 pyruvate dehydrogenase 5′ 290397 (lipoamide) alpha 2CGCTTCGGGGTGCATCT  655  0  12  12 0.00296  4 UNC5C unc5C 5′ 683CCGGGCAGCCTCAGAGG  656  2  15   5 0.01234  4 FABP2 intestinal fatty acid5′ 132509 binding protein 2 GCTGTCCGCACGCGGCC  657  0  10  10 0.03148  4SMAD1 Sma- and Mad-related 5′ 301087 protein 1 GCTGTCCGCACGCGGCC  658  0 10  10 0.03148  4 HSHIN1 HIV-1 induced protein 5′ 5967 HIN-1 isoform 1TGCACGCACACTCTTCC  659  3  15   3 0.0273  4 LOC152485 hypotheticalprotein 3′ 851 LOC152485 GTGGGGAGGCTGGGGCG  660  3  20   4 0.00474  4DCAMKL2 doublecortin and CaM 5′ 1633428 kinase-like 2 GTGGGGAGGCTGGGGCG 661  3  20   4 0.00474  4 NR3C2 nuclear receptor sub- 5′ 3189 family 3,group C, member 2 TTTTTCATCTTCCCCCC  662  2  20   7 0.0023  4 GLRBglycine receptor, beta 5′ 64 TTTTTCATCTTCCCCCC  663  2  20   7 0.0023  4PDGFC platelet-derived growth 5′ 104727 factor C precursorCTTAGATCTAGCGTTCC  664  3  28   6 0.00034  4 DKFZP564J102 DKFZP564J102protein 5′ 4 TAACGCTCCCGGGCCTC  665  4  27   4 0.00113  5 Not FoundTCTGCACGCCGGGGTCT  666  7  24   2 0.02576  5 POLS polymerase (DNA 5′23056 directed) sigma GGAGGTCTCAGGATCCC  667  7  24   2 0.02576  5FLJ20152 hypothetical protein 5′ 108193 FLJ20152 CCCACTTTCAAAGGGGG  66840  97   2 0.00318  5 FST follistatin isoform 5′ 517 FST344 precursorCCCACTTTCAAAGGGGG  669 40  97   2 0.00318  5 MOCS2 molybdopterinsypthase 5′ 370479 large subunit MOCS2B ACCCGGGCCGCAGCGGC  670 20  95  3 0  5 EFNA5 ephrin-A5 3′ 1019 CTGGGTTGCGATTAGCT  671  0  19  190.00146  5 PPIC peptidylprolyl isomerase 5′ 62181 C ACACATTTATTTTTCAG 672  0  14  14 0.00122  5 KIAA1961 KIAA1961 protein isoform 3′ 146 1GTGGGAGTCAAAGAGCT  673 10  55   4 2.8 ×  5 APXL2 apical protein 2 5′4006 10⁻⁵ CCGCTGGTGCACTCCGG  674 13  37   2 0.04341  5 TCF7transcription factor 7 3′ 252 (T-cell specific GTTTCTTCCCGCCCATC  675  0 25  25 0.00012  5 PHF15 PHD finger protein 15 3′ 1577 TCGCCGGGCGCTTGCCC  90 16  76   3 3 ×  5 PITX1 paired-like homeodomain 3′ 6163 10⁻⁶transcription factor 1 CTGACCGCGCTCGCCCC   91  8  28   2 0.03562  5PACAP proapoptotic caspase 5′ 4496 adaptor protein CCAGAGGGTCTTAAGTG 678  6  33   4 0.00184  5 NR3C1 nuclear receptor sub- 3′ 553 family 3,group C, member 1 ACCCACCAACACACGCC  679  4  21   3 0.00732  5 RANBP17RAN binding protein 17 3′ 402 CGTCTCCCATCCCGGGC  680  0  24  24 0.00007 5 CPLX2 complexin 2 3′ 1498 GCAGCAGCCTGTAATCC  681  0  11  11 0.00359 5 ZNF346 zinc finger rotein 346 3′ 167 GCCTGGCTTCCCCCCAG  682 21 135  4 0  5 PRR7 proline rich 7 3′ 7903 (synaptic) CGCCAGAGCTCTTTGTG  68310  38   3 0.00645  5 HNRPH1 heterogeneous nuclear 3′ 442ribonucleoprotein H1 GTTTCACGTCTCTGAGT  684  0   8   8 0.01239  5 BTNL9butyrophilin-like 9 3′ 12750 CTTTAGGTCGCAGGACA  685  0  14  14 0.00122 6 FOXF2 forkhead box F2 5′ 6373 TCAATGCTCCGGCGGGG  686  4  65  11 0  6TFAP2A transcription factor 5′ 4264 AP-2 alpha GGTCTCCGAAGCGAGCG  687  9 47   3 0.00018  6 MDGA1 MAM domain containing 3′ 934 GTGAAAGCATACCGTCA 688  0   8   8 0.01239  6 TFEB transcription factor EB 3′ 726GCTCTCACACAATAGGA  689  0   8   8 0.01239  6 DSCR1L1 Down syndromecritical 5′ 165679 region gene 1-like 1 AAGGAGACCGCACAGGG  690  7  45  4 6.9 ×  6 HTR1E 5-hydroxytryptamine 5′ 97 10⁻⁵ (serotonin) receptor1E AAGGAGACCGCACAGGG  691  7  45   4 6.9 ×  6 SYNCRIP synaptotagminbinding, 5′ 1294285 10⁻⁵ cytoplasmic RNA GTTGGAAATGGTGCGAA  692  0  10 10 0.00467  6 MAP3K7 mitogen-activated pro- 5′ 24225 tein kinase kinasekinase 7 ATTGTCAGATCTGGAAT  693  2  12   4 0.03293  6 MAP3K7mitogen-activated pro- 5′ 24225 tein kinase kinase kinase 7TCCATAGATTGACAAAG  694  2  20   7 0.0023  6 MARCKS myristoylatedalanine- 3′ 3067 rich protein kinase C TACAAGGCACTATGCTG  695  0  20  200.00085  6 MCMDC1 minichromosome mainte- 3′ 518 nance protein domainGAGAACGGCTCGGGCGC  696  4  42   7 1.1 ×  6 IBRDC1 IBR domain containing1 5′ 21103 10⁻⁵ GTTATGGCCAGAACTTG  697  3  47  10 1 ×  6 MOXD1monooxygenase, DBH-like 5′ 26536 10⁻⁶ 1 AACTTGAGAGCGATTTC  698  0  13 13 0.00244  6 RAB32 RAB32, member RAS 3′ 160 oncogene familyGCAGTGTTCTGCTTGGC  699  2  23   8 0.00081  6 SYNJ2 synaptojanin 2 5′ 124CAACCCACGGGCAGGTG  110 13  60   3 5.3 ×  6 TAGAP T-cell activation Rho5′ 123822 10⁻⁵ GTPase-activating protein GGCAGACAGGCCCTATC  701  0   7  7 0.01683  6 FGFR1OP FGFR1 oncogene partner 3′ 316 isoform aGCAAACGTCTAGTTATC  702  0  20  20 0.00024  7 LOC90637 hypotheticalprotein 5′ 49 LOC90637 ATGAGTCCATTTCCTCG  703  8  67   6 0  7 MGC10911hypothetical protein 5′ 96664 MGC10911 GGGGGGGAACCGGACCG  704  0  18  180.00185  7 ACTB beta actin 3′ 865 GGGGGTCTTTCCCCCTC  705  0  13  130.00244  7 FSCN1 fascin 1 3′ 1392 CATTTCCTCGGGTGTGA  706  2  16   50.00705  7 MPP6 membrane protein, 3′ 216 palmitoylated 6TATTTGCCAAGTTGTAC  113  0   8   8 0.01239  7 HOXA11 homeobox protein A113′ 622 ACAAAAATGATCGTTCT  708  3  20   4 0.00474  7 PLEKHA8 pleckstrinhomology do- 3′ 159 main containing, family A TCCGCCCTGCCCCGGGC  709  0 17  17 0.00068  7 ZNRF2 zinc finger/RING finger 3′ 94 2GGCTCTCCGTCTCTGCC  710  3  18   4 0.00867  7 CRHR2 corticotropinreleasing 3′ 521 hormone receptor 2 GAACGTGCGTTTGCTTT  711  0   9   90.00623  7 Not Found GTCCCCAGCACGCGGTC  712  5  33   4 0.00079  7 TBX20T-box transcription 5′ 607 factor TBX20 TGCCCTGGGCTGCCCGC  713  4  17  3 0.03271  7 TBX20 T-box transcription 5′ 4120 factor TBX20TGGCAAACCCATTCTTG  714  5  80  11 0  7 MRPS24 mitochondrial ribosomal 3′159 protein S24 GCCAGACTCCTGACTTG  715  5  50   7 2 ×  7 POLD2polymerase (DNA 3′ 11 10⁻⁶ directed), delta 2, regulatoryAACTTGGGGCTGACCGG  716  2  13   4 0.02369  7 AUTS2 autism susceptibility3′ 1095850 candidate 2 CCCAGTCTAGCCAAGGT  717  0  12  12 0.01257  7 NotFound CCCCGCCGCGCTGATTG  718  0   8  8 0.01239  7 GTF21 generaltranscription 3′ 1037 factor II, i isoform 1 CCTTCCGCCCGAGCGTC  719  0  7   7 0.01683  7 POR P450 (cytochrome) 5′ 39477 oxidoreductaseTAATCTCCCTAAATACC  720  0  14  14 0.00718  7 Not Found CACTAGACGTGCCTGAG 721  0  11  11 0.01852  7 DLX5 distal-less homeo box 5 3′ 3450TTTGGAGGAGTGGAGTT  722  4  28   5 0.00064  7 MYLC2PL myosin light chain2, 5′ 185120 precursor GGCGGCGGCCACTTCTG  723  0  12  12 0.01257  7SRPK2 SFRS protein kinase 2 3′ 120 isoform a TCTGAGTCGCCAGCGTC  724  3 31   7 0.00013  7 AASS aminoadipate- 5′ 171064 semialdehyde synthaseAGTATCAAAACGGCAGC  725  2  17   6 0.0052  7 Not Found CCGCGGCGCGCTCTCCC 726  0  11  11 0.01852  7 CUL1 cullin 1 5′ 351 TTATTTTTACAGCAAAC  727 0  10  10 0.00467  7 Not Found GAGCTGGCAAGCCTGGG  728  0   8   80.01239  7 ASB10 ankyrin repeat and SOCS 3′ 11480 box-containing proteinGATGCCACCAGGTTGTG  729  4  28   5 0.00064  7 HTR5A 5-hydroxytryptamine5′ 579 (serotonin) receptor 5A GATGCCACCAGGTTGTG  730  4  28   5 0.00064 7 PAXIP1L PAX transcription acti- 5′ 67372 vation domain interact- ingCGGACCACGCGTCCCTG  731  5   0  −8 0.02613  7 C7orf3 chromosome 7 open 5′154 reading frame 3 CGGACCACGCGTCCCTG  732  5   0  −8 0.02613  7 C7orf2limb region 1 protein 5′ 56421 GGGGCCTATTCACAGCC  733 13  61   3 3.8 × 8 TNKS tankyrase, TRF1-inter- 5′ 404285 10⁻⁵ acting ankyrin-relatedGGGGCCTATTCACAGCC  734 13  61   3 3.8 ×  8 PPP1R3B protein phosphatase1, 5′ 953 10⁻⁵ regulatory (inhibitor) CCAGACGCCGGCTCGGC  735  6  39   40.00023  8 ZDHHC2 rec 3′ 683 GCTTTTCAACCGTAGCG  736  0   8   8 0.01239 8 KCTD9 potassium channel 3′ 587 tetramerisation domainGTGACGATGGAGGAGCT  737  0  33  33 0.00001  8 DUSP4 dual specificityphos- 3′ 629 phatase 4 isoform 1 CACACACACACCCGGGC  738  2  14   50.0165  8 GPR124 G protein-coupled 3′ 114 receptor 124 CCTCCTGTTCCTCTGCC 739  3  36   8 3.7 ×  8 RAB11FIP1 Rab coupling protein 3′ 230 10⁻⁵isoform 3 CCCTGTCCTAGTAACGC  740  0  12  12 0.01257  8 DDHD2 DDHD domaincontaining 2 3′ 541 CTCCTCCTTCTTTTGCG  741  4  37   6 7.3 ×  8 ADAM9 adisintegrin and 3′ 542 10⁻⁵ metalloproteinase domain 9 CTTCAATTTGGTGAGGG 742  2  12   4 0.03293  8 MYST3 MYST histone acetyl- 3′ 462 transferase(monocytic) CGAGGAAGTGACCCTCG  743  0   7   7 0.01683  8 CHD7chromodomain helicase 5′ 156 DNA binding protein 7 GCGGGGGCAGCAGACGC 744  5  21   3 0.01878  8 PRDM14 PR domain containing 14 3′ 768CACCAGTCTTCGCCCGC  745  0   7   7 0.01683  8 RDH10 retinol dehydrogenase10 5′ 204 CACCAGTCTTCGCCCGC  746  0   7   7 0.01683  8 RPL7 ribosomalprotein L7 5′ 1264 TAACTGTCCTTTCCGTA  747  4  19   3 0.01426  8 NotFound TGCCATTCTGGAGAGCT  748  0  15  15 0.00413  8 LOC157567hypothetical protein 5′ 57 LOC157567 TAATTCGAGCACTTTGA  749  0  13  130.00244  8 FLJ20366 hypothetical protein 5′ 1280 FLJ203666AATAGGTAACTCACAAA  750  0  28  28 6.6 ×  8 FLJ14129 hypothetical protein5′ 237 10⁻⁵ FLJ14129 AAGTTGGCCACCTCGGG  751  0  11  11 0.00359  8 SCRIBscribble isoform b 3′ 194 ACTGCCTTGCCCCCTCC  752  0  18  18 0.00185  8PLEC1 plectin 1 isoform 1 5′ 1296 CTTGCCTCTCATCCTTC  753 12  91   5 0  8Sharpin shank-interacting 3′ 328 protein-like 1 GGGGTAACTCTTGAGTC  754 0   7   7 0.01683  8 Sharpin shank-interacting 3′ 328 protein-like 1GCCTCAGCCCGCACCCG  755  0   8   8 0.01239  8 DGAT1 diacylglycerol O- 5′84 acyltransferase 1 GGCACGGGAGCTGCTCC  756  3  42   9 4 ×  8 ADCK5 aarFdomain containing 3′ 748 10⁻⁶ kinase 5 GCGCCAACCCGGGCTGC  757  4  29   50.00051  8 CPSF1 cleavage and polyadenyl- 5′ 318 ation specific factor 1GCACCTCAGGCGGCAGT  758  2  12   4 0.03293  8 KIFC2 kinesin family memberC2 5′ 153 GCACCTCAGGCGGCAGT  759  2  12   4 0.03293  8 CYHR1 cysteineand histidine 5′ 735 rich 1 GACCTACTGGATTGCTC  760  0  20  20 0.00085  9ANKRD15 ankyrin repeat domain 5′ 171831 protein 15 AAATGAAACTAGTCTTG 761  0  17  17 0.00238  9 ANKRD15 ankyrin repeat domain 5′ 171831protein 15 TCTGTGTGCTGTGTGCG  762  3  17   4 0.01446  9 SMARCA2SWI/SNF-related matrix- 3′ 1580 associated CACAGCAGCCCGTCAGG  763  0   9  9 0.00623  9 TYRP1 tyrosinase-related 5′ 2080245 protein 1CACAGCAGCCCGTCAGG  764  0   9   9 0.00623  9 PTPRD protein tyrosinephos- 5′ 1594466 phatase, receptor type, D AGGGGGCTGCTCCGGAG  765  7  27  3 0.0099  9 MOBKL2B MOB1, Mps One Binder 3′ 1418 kinase activator-like2B GGGATACACACAGGGGA  766  2  12   4 0.03293  9 PAX5 paired box 5 3′48156 GTGCGGGCGACGGCAGC  767  3  34   8 7.8 ×  9 KLF9 Kruppel-likefactor 9 3′ 995 10⁻⁵ GGGTGCCGCGGCCACGA  768  6  24   3 0.01444  9 GNAQguanine nucleotide 3′ 302 binding protein (G protein) TAAATAGGCGAGAGGAG 769  6  34   4 0.00131  9 FLJ46321 FLJ46321 protein 5′ 299849TAAATAGGCGAGAGGAG  770  6  34   4 0.00131  9 TLE1 transducin-likeenhancer 5′ 241 protein 1 ATCGAGTGCGACGCCTG  771  0  15  15 0.00099  9PHF2 PHD finger protein 2 3′ 686 isoform b CCGCTTGCCCCGAAACC  772  0  10 10 0.03148  9 PTPN3 protein tyrosine phos- 5′ 316517 phatase,non-receptor type TCTTCTATTGCCTGATT  773  0  10  10 0.00467  9 SUSD1sushi domain containing 3′ 17 1 AAGTCAGTGCGCAAACG  774  0   8   80.01239  9 STOM stomatin isoform a 5′ 128954 GCGGGCGGCGCGGTCCC  775 44121   2 6.9 ×  9 LHX6 LIM homeobox protein 6 3′ 408 10⁻⁵ isoform 1ATTTGTGCAGCTACCGT  776  0   9   9 0.00623  9 Not Found AGGCAGGAGATGGTCTG 777  4  21   3 0.00732  9 PRDM12 PR domain containing 12 5′ 5017GGCGTTAATAGAGAGGC  778  0  13  13 0.00244  9 PRDM12 PR domain containing12 5′ 5017 AGGTTGTTGTTCTTGCA  779  5  29   4 0.00133  9 PRDM12 PR domaincontaining 12 3′ 1427 AGCCCTGGGCTCTCTCT  780  0   7   7 0.01683  9C9orf67 chromosome 9 open read- 5′ 11874 ing frame 67 AGCCCTGGGCTCTCTCT 781  0   7   7 0.01683  9 C9orf59 chromosome 9 open read- 5′ 1343 ingframe 59 CTCCTTTTGAGCCCCTG  782  0   8   8 0.01239  9 C9orf67 chromosome9 open read- 5′ 11874 ing frame 67 CTCCTTTTGAGCCCCTG  783  0   8   80.01239  9 C9orf59 chromosome 9 open read- 5′ 1343 ing frame 59CTCCCAGTACAGGAGCC  784 12  45   2 0.00281  9 RAPGEF1 guanine nucleotide-5′ 2333 releasing factor 2 isoform a TACGCGGGTGGGGGAGA  785  8  31   30.01478  9 ADAMTS13 a disintegrin-like and 3′ 6658 metalloproteaseCAGGGCCCTGGGTGCTG  786  0   8   8 0.01239  9 OLFM1 olfactomedin relatedER 3′ 74 localized protein AAGGAGCCTACGTTAAT  787  0  10  10 0.00467  9UBADC1 ubiquitin associated 3′ 10 domain containing 1 GAGGACAGCCGGCTCGT 788  0   7   7 0.01683  9 LHX3 LIM homeobox protein 3 3′ 4193 isoform bCAGCCAGCTTTCTGCCC  139 16  91   4 0  9 LHX3 LIM homeobox protein 3 5′146 isoform b TTTTCCCGAGGCCAGAG  790 11  33   2 0.04578  9 EGFL7EGF-like-domain, 3′ 2912 multiple 7 AAGAGCAAATAAGAGGC  791  0   7   70.01683 10 KIAA0934 KIAA0934 3′ 138 AGCCACCGTACAAGGCC  792 12  40   20.01181 10 PFKP phosphofructokinase, 3′ 1056 platelet CCCCAGGCCTCGGCCAG 793  0   7   7 0.01683 10 ANKRD16 ankyrin repeat domain 16 5′ 375isoform a CTCAGAGGAGGGGCAGA  794  0  11  11 0.00359 10 ANKRD16 ankyrinrepeat domain 16 5′ 375 isoform a AAAATAGAGGTTCCTCC  795  0  30  30 2.8× 10 PRPF18 PRP18 pre-mRNA process- 5′ 58621 10⁻⁵ ing factor 18 homologAAAATAGAGGTTCCTCC  796  0  30  30 2.8 × 10 C10orf30 chromosome 10 open5′ 25417 10⁻⁵ reading frame 30 ACCTCGAAGCCGCCAAG  797  0   7   7 0.0168310 ZNF32 zinc finger protein 32 5′ 101 AATGAACGACCAGACCC  798 10  56   40.00002 10 DDX21 DEAD (Asp-Glu-Ala-Asp) 3′ 506 box polypeptide 21GGTCGCTCCTCGTTGGG  799  0  10  10 0.00467 10 C10orf13 hypotheticalprotein 3′ 771 MGC39320 GAGTTTCTTTAGTAAAG  800  0  10  10 0.00467 10GPR120 G protein-coupled 3′ 255 receptor 120 AGTTAGTTCCCAACTCA  801  0 10  10 0.00467 10 MLR2 ligand-dependent 5′ 84 corepressorAGTTAGTTCCCAACTCA  802  0  10  10 0.00467 10 PIK3AP1 phosphoinositide-3-5′ 112373 kinase adaptor protein 1 GGGACAGGTGGCAGGCC  803 19  64   20.00074 10 PAX2 paired box protein 2 5′ 6126 isoform b GAGCTAATCAATAGGCA 804  0  10  10 0.00467 10 PAX2 paired box protein 2 5′ 6126 isoform bTGGGAAAGGTCTTGTGG  805 10  36   2 0.01161 10 LZTS2 leucine zipper,putative 3′ 2691 tumor suppressor 2 GCGGCCGCGGGCAGGGG  806  0   7   70.01683 10 TRIM8 tripartite motif- 5′ 375 containing 8 CTGCCCGCAGGTGGCGC 807  9  42   3 0.00094 10 CNNM2 cyclin M2 isoform 1 3′ 212GAGGTAGTGCCCTGTCC  808  3  16   4 0.01997 10 SH3MD1 SH3 multiple domains1 3′ 24 TTGTGTGTACATAGGGC  809  0  11  11 0.00359 10 SORCS1 SORCSreceptor 1 isoform 5′ 1301646 a GCTCATTGCGTCCCGCT  810  8  33   30.00804 10 KIAA1598 KIAA1598 3′ 509 AGCAGCAGCCCCATCCC  811 12  42   20.00672 10 EMX2 empty spiracles homolog 5′ 166361 2 AGCAGCAGCCCCATCCC 811 12  42   2 0.00672 10 PDZK8 PDZ domain containing 8 5′ 657GGGCCCCGCCCAGCCAG  813  0  18  18 0.00185 10 C10orf137 erythroiddifferentia- 5′ 556810 tion-related factor 1 GGGCCCCGCCCAGCCAG  814  0 18  18 0.00185 10 CTBP2 C-terminal binding 5′ 2249 protein 2 isoform 1TGCGCTTGGCAGCCGGG  815  0   8   8 0.01239 10 ADAM12 a disintegrin andmetal- 3′ 464 loprotease domain 12 TCAGAGGCTGATGGGGC  816  7  31   30.00755 10 MGMT O-6-methylguanine-DNA 5′ 1340765 methyltransferaseTCAGAGGCTGATGGGGC  817  7  31   3 0.00755 10 MK167 antigen identified by5′ 232 monoclonal antibody Ki-67 TGGAGGCAGGTGCACAG  818  0  12  120.01257 10 CYP2E1 cytochrome P450, 3′ 826 family 2, subfamily ECAGCCGAAGTGGCGCTC  819  0  13  13 0.00244 11 NALP6 NACHT, leucine richre- 3′ 1950 peat and PYD containing 6 GCCTGGCACTGGGTCCA  820  0  12  120.01257 11 C11orf13 HRAS1-related cluster-1 5′ 374 GCCTGGCACTGGGTCCA 821  0  12  12 0.01257 11 MGC35138 hypothetical protein 5′ 297 MGC35138GAAAACTCCAGATAGTG  822  6  21   2 0.03859 11 ASCL2 achaete-scute complex3′ 582 homolog-like 2 CTTTGAAATAAGCGAAT  823  0   7   7 0.01683 11 PDE3Bphosphodiesterase 3B, 3′ 526 cGMP-inhihited GCGCTGCCCTATATTGG  824  3 22   5 0.00215 11 FLJ11336 hypothetical protein 3′ 375 FLJ11336TCTAGGACCTCCAGGCC  825 12  69   4 1 × 11 SLC39A13 solute carrier family39 5′ 415 10⁻⁶ (zinc transporter) TCTAGGACCTCCAGGCC  826 12  69   4 1 ×11 SPI1 spleen focus forming 5′ 29668 10⁻⁶ virus (SFFV) proviralCCCTGCCCTTAGTGCTT  827  0  10  10 0.03148 11 Not Found CTCTGGGCTGTGAGGAC 828  0  12  12 0.00296 11 C11ORF4 chromosome 11 hypothet- 5′ 458 icalprotein ORF4 CTCTGGGCTGTGAGGAC  829  0  12  12 0.00296 11 BADBCL2-antagonist of cell 5′ 708 death protein CGCCCCTTCCCTGCGCC  830  0 15  15 0.00413 11 FBXL11 F-box and leucine-rich 5′ 454 repeat protein11 CCACAGACCAGTGGGTG  831  0  14  14 0.00718 11 TPCN2 two pore segmentchannel 3′ 305 2 GCCCTGCATACAACCCT  832  6  26   3 0.00682 11 Not FoundGCTCAGAGGCGCTGGAA  833  3  21   5 0.0037 11 ZBTB16 zinc finger and BTBdo- 3′ 913 main containing 16 CCCCGGCAGGCGGCGGC  834  8  35   3 0.004311 ROBO3 roundabout, axon 5′ 64774 guidance receptor, homolog 3CCCCGGCAGGCGGCGGC  835  8  35   3 0.0043 11 FLJ23342 hypotheticalprotein 5′ 208 FLJ23342 GATTATGAAAGCCCATC  836  0  17  17 0.00068 11BARX2 BarH-like homeobox 2 5′ 2434 GATTATGAAAGCCCATC  837  0  17  170.00068 11 RICS Rho GTPase-activating 5′ 349388 proteinCGACATATCAGGGATCA  838  0   8   8 0.01239 11 APLP2 amyloid beta (A4) 5′589 precursor-like protein 2 CTCCAGCCCTGTGTCCT  839  0  13  13 0.0092312 M160 scavenger receptor 3′ 3750 cysteine-rich type 1 proteinCCTGCCGGTGGAGGGCA  840 12  44   2 0.00377 12 ST8SIA1 ST8 alpha-N-acetyl-5′ 176 neuraminide CCACGTCTTAGCACTCT  841  2  19   6 0.00296 12 DDX11DEAD H (Asp-Glu-Ala- 5′ 277542 Asp/His) box polypeptide 11CCACGTCTTAGCACTCT  842  2  19   6 0.00296 12 C1QDC1 C1q domaincontaining 1 5′ 41819 isoform 2 GCTGCCCCAAGTGGTCT  180  4  33   50.00031 12 Not Found GCGGCCTCAGGTGAGCG  844  2  13   4 0.02369 12 EIF4Beukaryotic translation 3′ 587 initiation factor 4B TCCCCACCCCTGGTACC 845  0   7   7 0.01683 12 LOC56901 NADH ubiquinone oxidore- 5′ 1764ductase MLRQ subunit TCTCCGTGTATGTGCGC  846  3  20   4 0.00474 12 HMGA2high mobility group AT- 3′ 1476 hook 2 TTGACAGGCAGACAAGT  847  0   9   90.00623 12 ATP2B1 plasma membrane calcium 5′ 52908 ATPase 1 isoform 1bCCTTCCTCCCCACGCAG  848  2  16   5 0.00705 12 NFYB nuclear transcription5′ 197 factor Y, beta TTGCAAAGAACGGAGCC  849  0   9   9 0.00623 12 CUTL2cut-like 2 3′ 265 TCAAGTGTGAGGGGAAG  850  2  22   7 0.00104 12 PBPproslatic binding 5′ 32016 protein TCAAGTGTGAGGGGAAG  851  2  22   70.00104 12 FLJ20674 hypothetical protein 5′ 104 FLJ20674ACAAAGTACCGTGGTTC  852  0  16  16 0.0031 12 TSP-NY testis-specificprotein 3′ 81 TSP-NY isoform a GAGGCCAGATTTTCTCC  853  2  46  15 0 12HIP1R huntingtin interacting 5′ 170 protein-1-related AAGGCTGGGAGTTTTCT 854  4  22   4 0.00554 12 ABCB9 ATP-binding cassette, 3′ 517 sub-familyB (MDR/TAP) GGGCGGCCGGCGGGGGC  855 10   0 −15 0.00558 12 Not FoundCGAACTTCCCGGTTCCG  856 21  96   3 0 12 Not Found CAGCGGCCAAAGCTGCC  85716  69   3 2.5 × 12 RAN ras-related nuclear 5′ 257 10⁻⁵ proteinCAGCGGCCAAAGCTGCC  858 16  69   3 2.5 × 12 EPIM epimorphin isoform 2 5′32499 10⁻⁵ CGCAGGCTACCAGTGCA  859  2  12   4 0.03293 12 PUS1pseudouridylate 5′ 740 synthase 1 CACTGCCTGATGGTGTG  860 18 107   4 0 13IL17D interleukin 17D 3′ 277 precursor AAGGTCTCTACCGCGCC  861  0  13  130.00244 13 WDFY2 WD repeat- and FYVE 5′ 130880 domain-containing pro-tein 2 AAGGTCTCTACCGCGCC  862  0  13  13 0.00244 13 DDX26 DEAD/H(Asp-Glu-Ala- 5′ 629 Asp/His) box polypeptide 26 TTTGCTACGTGTACATC  863 0  14  14 0.00122 13 RANBP5 RAN binding protein 5 3′ 23155CCACCAGCCTCCCTCGG  864  8  79   7 0 13 DOCK9 dedicator of cytokinesis 5′1277 9 CAGTGGCCTCCATCTGG  865  7  26   2 0.01495 13 KDELC1 KDEL(Lys-Asp-Glu-Leu) 3′ 141 containing 1 GGTTCGAAGGGCAGCGG  866  4  46   83 × 14 PPM1A protein phosphatase 1A 3′ 733 10⁻⁶ isoform 1AGCTCTGCCAGTAGTTG  867  5  32   4 0.00112 14 MTHFD1 methylenetetrahydro-5′ 49925 folate dehydrogenase 1 AGCTCTGCCAGTAGTTG  868  5  32   40.00112 14 ESR2 estrogen receptor 2 5′ 44089 TGCCCAGCCCTCAGCAC  869  0 11  11 0.00359 14 SFRS5 splicing factor, 5′ 40145 arginine/serine-rich5 CCTCTAGGACCAAGCCT  870  2  24   8 0.00064 14 SLC8A3 solute carrierfamily 8 3′ 270 member 3 isoform B GAGTCGCAGTATTTTGG  871  6  31   30.0036 14 GTF2A1 TFIIA alpha, p55 isoform 3′ 181 1 CGGCGCAGCTCCAGGTC 872 21  55   2 0.01977 14 KCNK10 potassium channel, sub- 3′ 3468 familyK, member 10 GCCTTCAGGTTGCGGGT  873  0  16  16 0.00081 14 BCL11B B-cellCLL/lymphoma 11B 3′ 25026 isoform2 GCCCCACGCCCCCTGGC  874  8  50   4 2.9× 14 C14orf153 chromosome 14 open 5′ 681 10⁻⁵ reading frame 153GCCCCACGCCCCCTGGC  875  8  50   4 2.9 × 14 BAG5 BCL2-associated 5′ 1910⁻⁵ athanogene 5 GAGGCCAGCCTGAGGGC  876  0   7   7 0.01683 14 C14orf151chromosome 14 open 5′ 39104 reading frame 151 GAGGCCAGCCTGAGGGC  877  0  7   7 0.01683 14 FLJ42486 FLJ42486 protein 5′ 45756 TTCCAGTGGCAAGTTGA 878 12  43   2 0.00504 14 CDCA4 cell division cycle 3′ 550 associated 4TCGAGCCGCGCGGTCGT  879  0   8   8 0.01239 15 KLF13 Kruppel-like factor13 3′ 1607 GCTCTGCCCCCGTGGCC  880  6  58   6 0 15 BAHD1 bromo adjacenthomology 5′ 138 domain containing 1 GCAGAGGCTGAGCGGCC  881  0   8   80.01239 15 C15orf21 D-PCa-2 protein isoform 3′ 11782 c GCCGCCCCCCGACCGAA 882  0   8   8 0.01239 15 ONECUT1 one cut domain, family 3′ 4340 member1 TTTCTCCTGATGGAGTC  883  0  12  12 0.00296 15 DAPK2 death-associatedprotein 5′ 207 kinase 2 TCAGGCTTCCCCTTCGG  884  7  27   3 0.0099 15PIAS1 protein inhibitor of 5′ 190450 activated STAT, 1 GCCCCAACCGGTCCTTC 885  9  29   2 0.04715 15 PKM2 pyruvate kinase 3 3′ 300 isoform 1GACCCCACAAGGGCTTG  886  3  41   9 6 × 15 LOC92912 hypothetical protein5′ 119 10⁻⁶ LOC92912 CCTTGAGAGCAGAGAGC  887  4  31   5 0.00032 15 LRRN6Aleucine-rich repeat 3′ 43 neuronal 6A TGGGGACTGATGCACCC  888  6  30   30.00501 15 CIB2 DNA-dependent protein 3′ 598 kinase catalyticCACGTGAGGGGGTGGTA  889  4  32   5 0.00045 15 BLP2 BBP-like protein 2 5′22 isoform a CCCGCGGGAGAGACCGG  890  3  28   6 0.00034 16 E4F1 p120E4F5′ 8954 CCCGCGGGAGAGACCGG  891  3  28   6 0.00034 16 MGC21830hypothetical protein 5′ 3623 MGC21830 CCGGGTCCGCGGGCGAG  892 13  40   20.02012 16 USP7 ubiquitin specific 3′ 725 protease 7 (herpesATCCGGCCAAGCCCTAG  893  6  37   4 0.00047 16 ATF7IP2 activatingtranscription 5′ 244550 factor 7 interacting ATCCGGCCAAGCCCTAG  894  6 37   4 0.00047 16 GRIN2A N-methyl-D-aspartate 5′ 809 receptor subunit2A TTCCTACCCCCTACACC  895  2  20   7 0.0023 16 TXNDC11 thioredoxindomain 3′ 238 containing 11 GAGGGAGCTTGACATTC  896  5  40   5 6.5 × 16LOC146174 hypothetical protein 3′ 214 10⁻⁵ LOC146174 GCCTATAGGGTCCTGGG 897  2  12   4 0.03293 16 HS3ST2 heparan sulfate 3′ 227 D-glucosaminylGGGTAGGCACAGCCGTC  898  3  27   6 0.00044 16 TBX6 T-box 6 isoform 1 5′85 TGCGCGCGTCGGTGGCG  899  6  22   2 0.02566 16 LOC51333 mesenchymalstem cell 3′ 9832 protein DSC43 AACTATCCAGGGACCTG  900  2  14   5 0.016516 FLJ38101 hypothetical protein 5′ 167223 FLJ38101 AACTATCCAGGGACCTG 901  2  14   5 0.0165 16 ZNF423 zinc finger protein 423 5′ 31051GTTGGGGAAGGCACCGC  902  6  34   4 0.00131 16 FLJ38101 hypotheticalprotein 5′ 167223 FLJ38101 GTTGGGGAAGGCACCGC  903  6  34   4 0.00131 16ZNF423 zinc finger rotein 423 5′ 31051 ACAATAGCGCGATCGAG  904  3  20   40.00474 16 IRX5 iroquois homeobox 5′ 455 protein 5 ACAATAGCGCGATCGAG 904  3  20   4 0.00474 16 IRX3 iroquois homeobox 5′ 644277 protein 3GGGCGCGCCGCGCCGCG  906  7   0 −11 0.00579 16 IRX5 iroquois homeobox 5′455 protein 5 GGGCGCGCCGCGCCGCG  907  7   0 −11 0.00579 16 IRX3 iroquoishomeobox 5′ 644277 protein 3 CGATTCGAAGGGAGGGG  908  0  41  41 1 × 16IRX6 iroquois homeobox 5′ 386305 10⁻⁶ protein 6 GTGCAGTCTCGGCCCGG  909 6  35   4 0.00093 16 FBXL8 F-box and leucine-rich 3′ 3905 repeatprotein 8 GGGATCCTCTTGCAAAG  910  4  21   3 0.00732 16 DNCL2B dynein,cytoplasmic, 5′ 939218 light polypeptide 2B GGGATCCTCTTGCAAAG  911  4 21   3 0.00732 16 MAF v-maf musculoaponeurotic 5′ 1024 fibrosarcomaoncogene AGCCACCACACCCTTCC  912  8  32   3 0.01092 16 EFCBP2 neuronalcalcium-binding 3′ 36 protein 2 AACACCCTCAGCCAGCC  913  0   9   90.00623 17 MNT MAX binding protein 3′ 8124 CCGTGTTGTCCTGCCCG  914  4  28  5 0.00064 17 MNT MAX binding protein 3′ 228 CAAAGCCACACAGTTTA  915  0  8   8 0.01239 17 MGC2941 hypothetical protein 3′ 1256 MGC2941GCGGAGCCCAGTCCCGA  916  0  17  17 0.00238 17 MGC2941 hypotheticalprotein 3′ 1256 MGC2941 CCACACCTCTCTCCAGG  917  0  16  16 0.00081 17SENP3 SUMO1/sentrin/SMT3 5′ 326 specific protease 3 TGGGAGTCACGTCCTCA 918  0  13  13 0.00244 17 FLJ20014 hypothetical protein 3′ 948 FLJ20014CGCTTTTGACACATTGG  919  9  42   3 0.00094 17 NDEL1 nudE nucleardistribu- 3′ 550 tion gene E homolog like 1 GCTGCCGCCGGCGCAGC  920  3 26   6 0.00077 17 GLP2R glucagon-like peptide 5′ 181348 2 receptorprecursor CTGGTCTGCGGCCTCCG  921  0  20  20 0.00024 17 LOC116236hypothetical protein 3′ 155 LOC116236 GCCGCGCACAGGCCGGT  922  3  28   60.00034 17 NF1 neurofibromin 3′ 603 CACCAGAAACCTCGGGG  923  4  23   40.00427 17 DUSP14 dual specificity 5′ 198 phosphatase 14CCAAGGAACCTGAAAAC  924  0   9   9 0.00623 17 ACLY ATP citrate lyase 3′446 isoform 1 CCTACCTATCCCTGGAC  925  7  49   5 1.7 × 17 STAT5A signaltransducer and 3′ 1085 10⁻⁵ activator of transcription GCTATGGGTCGGGGGAG 215 49 140   2 6 × 17 SOST sclerostin precursor 3′ 3140 10⁻⁶GATGCTCGAACGCAGAG  927  0  10  10 0.00467 17 SOST sclerostin precursor3′ 3140 GAGGCTGGCACCCAGGC  928  0  22  22 0.00016 17 C1QL1 complementcomponent 1, 3′ 8471 q subcomponent-like 1 AACACGCTGGCTCTTGC  929  0  12 12 0.00296 17 CRHR1 corticotropin releasing 3′ 1129 hormone receptor 1GAGCTGATCACCATTCT  930  0   9   9 0.00623 17 KPNB1 karyopherin beta 1 3′758 TGTGTCTGCGTAGAAAT  931  0   7   7 0.01683 17 HOXB9 homeo box B9 3′455 GTCCTGCGGGGCGAGAG  932  3  22   5 0.00215 17 NME2nucleoside-diphosphate 5′ 163 kinase 2 CATTTCCTGGGCTATTT  933  0   7   70.01683 17 MRC2 mannose receptor, C type 3′ 527 2 CCCCTGCCCTGTCACCC  226 0  48  48 0 17 SLC9A3R1 solute carrier family 9 3′ 11941(sodium/hydrogen CTGCCCGGCAGCCAGCC  935  0   7   7 0.01683 17 CBX2chromobox homolog 2 5′ 361 isoform 2 TTGACTCGCCGCTTCCC  936  0   8   80.01239 17 CBX8 chromobox homolog 8 5′ 620 CCCCAGGCCGGGTGTCC  303 10  65  4 1 × 17 CBX8 chromobox homolog 8 5′ 16730 10⁻⁶ CCTCTTCCCAGACCGAA  938 0  18  18 0.00185 17 CBX4 chromobox homolog 4 5′ 1307 ACCCGCACCATCCCGGG 229 88 201   2 4.1 × 17 CBX4 chromobox homolog 4 5′ 4600 10⁻⁵TCCCTCATTCGCCCCGG  940 18  79   3 4 × 18 EMILIN2 elastin microfibtil 3′143 10⁻⁶ interfacer 2 CACACGCACGGGAGCGC  941  0   8   8 0.01239 18ZFP161 zinc finger protein 161 5′ 2780 homolog TGAAGAAAAGGCCTTTG  942  0  7   7 0.01683 18 ACAA2 acetyl-coenzyme A 5′ 380776 acyltransferase 2GAACTATCTTCTACCAA  943  2  21   7 0.00133 18 RNF152 ring finger protein152 5′ 1155 CGCATAAGGGGTGTGGC  944  0   7   7 0.01683 18 FBXO15 F-boxprotein 15 3′ 23 GAGAATAAATTACTGGG  945  0   7   7 0.01683 18 ZNF236zinc finger protein 236 5′ 1649 TCCGGAGTTGGGACCTC  946  2  22   70.00104 19 Not Found CTCCGGCTTCAGTGGCC  947  3  20   4 0.00474 19C19orf24 chromosome 19 open read- 3′ 156 ing frame 24 AACGGGATCCGCACGGG 948  3  21   5 0.0037 19 APC2 adenomatosis polyposis 3′ 18214 coli 2GCCATCTCTTCGGGCGC  949  6   0  −9 0.00911 19 KLF16 BTE-binding protein 43′ 2472 ACAGTAGCGCCCCCTCT  950  0  13  13 0.00244 19 MGC17791hypothetical protein 5′ 57795 MGC17791 ACAGTAGCGCCCCCTCT  951  0  13  130.00244 19 SEMA6B semaphorin 6B isoform 1 5′ 23231 precursorCTCCGAGGCGGCCACCC  952  0   9   9 0.00623 19 ARHGEF18 Rho-specificguanine nu- 5′ 106295 cleotide exchange factor CTCCGAGGCGGCCACCC  953  0  9   9 0.00623 19 INSR insulin receptor 5′ 559 CCCTCTGCAAGCACCAC  954 0   9   9 0.00623 19 FLJ23420 hypothetical protein 5′ 19155 FLJ23420ATCGTAGCTCGCTGCAG  955  0  10  10 0.03148 19 FLJ23420 hypotheticalprotein 5′ 75 FLJ23420 AAGGACGGGAGGGAGAA  956  0   8   8 0.01239 19LASS4 LAG1 longevity assurance 5′ 60310 homolog 4 AAGGACGGGAGGGAGAA  957 0   8   8 0.01239 19 FBN3 fibrillin 3 precursor 5′ 1561CAGACTTTAGTTTTGAA  958  0  11  11 0.01852 19 UBL5 ubiquitin-like 5 5′197 CAGACTTTAGTTTTGAA  959  0  11  11 0.01852 19 FBXL12 F-box andleucine-rich 5′ 8685 repeat protein 12 GTCGTTCAGGGGCGTCT  960  0  14  140.00122 19 LOC90580 hypothetical protein 3′ 349 BC011833GCTCCAGCGATGATTGT  961  0  11  11 0.01852 19 ELAVL3 ELAV-like protein 33′ 923 isoform 1 ACCCTCGCGTGGGCCCC  962 13  42   2 0.01177 19 ZNF136zinc finger protein 136 5′ 89 (clone pHZ-20) ACCCTCGCGTGGGCCCC  963 13 42   2 0.01177 19 ZNF625 zinc finger protein 625 5′ 6300CCTCCCGCCCGGCCCGG  964  2  13   4 0.02369 19 SAMD1 sterile alpha motifdo- 5′ 889 main containing 1 AGCCTGCAAAGGGGAGG  965  0  50  50 0 19AKAP8L A kinase (PRKA) anchor 5′ 13794 protein 8-like CAGAGGGAATAACCAGT 966  0  12  12 0.01257 19 KIAA1533 KIAA1533 3′ 119 ACCTCAAGCACGCGGTC 967  0   8   8 0.01239 19 KIAA1533 KIAA1533 3′ 576 TGATTGTGTGTGAGGCT 968  0  16  16 0.0031 19 Not Found ACGAGCACACTGAAAAG  969  6  44   50.00004 19 AKT2 v-akt murine thymoma 3′ 451 viral oncogene homolog 2TTGGGTTCGCTCAGCGG  970  6  30   3 0.00501 19 ASE-1CD3-epsilon-associated 5′ 1320 protein; antisense to TTGGGTTCGCTCAGCGG 971  6  30   3 0.00501 19 PPP1R13L protein phosphatase 1, 5′ 11721regulatory (inhibitor) CGTGGGAAACCTCGATG  972  0  23  23 8.5 × 19 ASE-1CD3-epsilon-associated 5′ 1320 10⁻⁵ protein; antisense toCGTGGGAAACCTCGATG  973  0  23  23 8.5 × 19 PPP1R13L protein phosphatase1, 5′ 11721 10⁻⁵ regulatory (inhibitor) AGACTAAACCCCCGAGG  974  7  64  6 0 19 ASE-1 CD3-epsilon-associated 3′ 824 protein; antisense toCTGGTGGGGAAGGTGGC  975  2  20   7 0.0023 19 SIX5 sine oculis homeobox 3′1102 homolog 5 TACAGCTGCTGCAGCGC  976  2  12   4 0.03293 19 GRIN2DN-methyl-D-aspartate 3′ 48538 receptor subunit 2D GTTTATTCCAAACACTG  977 0  10  10 0.00467 19 GRIN2D N-methyl-D-aspartate 3′ 48538 receptorsubunit 2D CTCACGACGCCGTGAAG  978 33  96   2 0.00021 20 SOX12 SRY (sexdetermining 3′ 123 region Y)-box 12 TCAGCCCAGCGGTATCC  979  2  21   70.00133 20 RRBP1 ribosome binding protein 3′ 270 1 GTTTACCCTCTGTCTCC 980  7  56   5 1 × 20 RIN2 RAB5 interacting protein 5′ 130452 10⁻⁶ 2GAAAAGACTGCCCTCTG  981  0   7   7 0.01683 20 ZNF336 zinc finger protein336 5′ 2846 GACAACGCGGGGAAGGA  982  0  10  10 0.00467 20 NAPBN-ethylmaleimide- 3′ 859 sensitive factor attachment GCAAGGGGCAGAGAAAG 983  0   8   8 0.01239 20 PDRG1 p53 and DNA damage- 3′ 23 regulatedprotein GCTGAGAGCTGCGGGTG  984  0  11  11 0.00359 20 TSPYL3 TSPY-like 33′ 38 AGCAACTTTCCTGGGTC  985  6  32   4 0.00258 20 PLAGL2 pleinmorphicadenoma 3′ 179 gene-like 2 CGCTCCCACGTCCGGGA  986  0  16  16 0.00081 20SNTA1 acidic alpha 1 3′ 288 syntrophin CTTTCAAACTGGACCCG  987  0  28  286.6 × 20 Not Found 10⁻⁵ CGCGCAGCTCGCTGAGG  988  2  21   7 0.00133 20 NotFound GGATAGGGGTGGCCGGG  989  0  24  24 0.00015 20 MATN4 matrilin 4isoform 1 3′ 11782 precursor CGCAACCCTGGCGACGC  990  0  13  13 0.0024420 CDH22 cadherin 22 precursor 5′ 56203 GGGAATAGGGGGGCGGG  991 15  73  3 3 × 20 CDH22 cadherin 22 precursor 5′ 56203 10⁻⁶ GGGGATTCTACCCTGGG 992 10  54   4 3.9 × 20 ARFGEF2 ADP-ribosylation factor 5′ 93944 10⁻⁵guanine GGGGATTCTACCCTGGG  993 10  54   4 3.9 × 20 PREX1 PREX1 protein5′ 62 10⁻⁵ CCTGCGCCGCCGCCCGG  994  8  29   2 0.0267 20 CEBPBCCAAT/enhancer binding 3′ 446 protein beta ATCCCCGAGCTGCTGGA  995  7  30  3 0.01035 20 TMEPAI transmembrane prostate 3′ 277 androgen-inducedprotein TCCAGAGGCCCGAGCTC  996  8  26   2 0.02912 20 PPP1R3D proteinphosphatase 1, 3′ 627 regulatory subunit 3D AAGCGGGGAGGCTGAGG  997  0 19  19 0.00029 20 OSBPL2 oxysterol-binding 3′ 254 protein-like protein2 isoform TGTCACAGACTCCCAGC  998  8  38   3 0.00165 21 USP25 ubiquitinspecific 5′ 664846 protease 25 TGTCACAGACTCCCAGC  999  8  38   3 0.0016521 NRIP1 receptor interacting 5′ 96802 protein 140 GAAATGTGGCCAGTGCA1000  0   7   7 0.01683 21 SIM2 single-minded homolog 2 3′ 48171 longisoform AGTCCTTGCTGGGGTCC 1001  0  18  18 0.00185 21 PKNOX1 PBX/knotted1 homeobox 3′ 384 1 isoform 1 ACCCTGAAAGCCTAGCC  266  8  59   5 1 × 21ITGB2 integrin beta chain, 5′ 10805 10⁻⁶ beta 2 precursorAATGGAACTGACCACTG 1003  9  36   3 0.00621 22 TUBA8 tubulin, alpha 8 5′44 GGGGGCCTGCAGGGTGG 1004 34 105   2 3.3 × 22 ARVCF armadillo repeatprotein 3′ 720 10⁻⁵ CCCACCAGGCACGTGGC 1005 19  50   2 0.02718 22 NPTXRneuronal pentraxin 5′ 376 receptor isoform 1 GTGGCCGTGGACCCTGA 1006  5 23   3 0.00997 22 ATF4 activating transcription 5′ 850 factor 4GCCTCAGCATCCTCCTC 1007  2  30  10 8.6 × 22 FLJ27365 FLJ27365 protein 5′24574 10⁻⁵ GCCTCAGCATCCTCCTC 1008  2  30  10 8.6 × 22 FLJ10945hypothetical protein 5′ 7284 10⁻⁵ FLJ10945 GCCCTGGGGTGTTATGG 1009  2  26  9 0.00029 22 FLJ27365 FLJ27365 protein 5′ 13829 GCCCTGGGGTGTTATGG 1010 2  26   9 0.00029 22 FLJ10945 hypothetical protein 5′ 18029 FLJ10945AAGAGCCAGGCCACGGG 1011  2  14   5 0.0165 22 FLJ41993 FLJ41993 protein 5′2751 GTTTCGAAATGAGCTCC 1012  0  12  12 0.00296 23 GPM6B glycoprotein M6B3′ 267 isoform 1 GAGATGCGCCTACGCCC 1013 11  65   4 2 × 23 NHSNance-Horan syndrome 3′ 274 10⁻⁶ protein TAGTTCACTATCGCTTC 1014  4  19  3 0.01426 23 SH3KBP1 SH3-domain kinase 3′ 346 binding protein 1GGTCTCCTGAGGACCAG 1015  4  19   3 0.01426 23 Not Found ACTCATCCCTGAAGAGT1016  0  10  10 0.00467 23 DDX3X DEAD/H (Asp-Glu-Ala- 5′ 246 Asp/His)box polypeptide 3 CCTCAGATCAGGATGGG 1017  2  20   7 0.0023 23 NYXnyctalopin 5′ 4793 GTCTGGTCGATGTTGCG 1018  4  25   4 0.00186 23 MID2midline 2 isoform 1 5′ 50400 GTCTGGTCGATGTTGCG 1019  4  25   4 0.0018623 DS1PI delta sleep inducing 5′ 42 peptide, immunorcactorTAGTACTTTCAGGTAGG 1020  0   9   9 0.00623 23 UBE2A ubiquitin-conjugating3′ 285 enzyme E2A isoform 2 ATTTACACGGGGCTCAC 1021  0  10  10 0.03148 23STAG2 stromal antigen 2 5′ 1402 GGGGCGAAGAAAGCAGA 1022  3  26   60.00077 23 STAG2 stromal antigen 2 5′ 1402 ATCCTGTCCCTGGCCTC 1023  0   9  9 0.00623 23 SLC6A8 solute carrier family 3′ 89 6 (neurotransmitterGCGGCAGCGGCGCCGGC 1024 11   0 −17 0.00314 23 CXorf12 chromosome X open5′ 745 reading frame 12 GCGGCAGCGGCGCCGGC 1025 11   0 −17 0.00314 23HCFC1 host cell factor C1 5′ 7318 (VP16-accessory protein)GAAGCAAGAGTTTGGCC 1026  2  62  21 0 23 FLNA filamin 1 (actin- 3′ 3103binding protein-280) The column headings are as in Table 2 except thatthe MSDK libraries compared are the N-STR-I7 and I-STR-7 MSDK libraries(See Table 3 for details of the tissues from which these libraries weremade).

TABLE 8 MSDK tags significantly (p <0.050) differentially present inN-STR-117 and I-STR-17 MSDK libraries and genes associated with the MSDKtags. Posi- Ra- tion tio of I- AscI Distance STR- site of AscI I7/ inre- site SEQ N- I- N- lation from tr. ID STR- STR- STR- to tr. StartMSDK Tag NO. I17 17 I17 P value Chr Gene Description Start (bp)AAGCTGCTGCGGCGGGC 1027  5  0 −7 0.0254984  1 B3GALT6 UDP-Gal: betaGalbeta 3′ 335 1,3-galactosyltrans- ferase GCGCGGGAAGGGGTGGA 1028  0  8  80.0316311  1 SPEN spen homolog, trans- 5′ 11971 regulatorGTGGTCTTCAGAGGTAG 1029  0  8  8 0.0316311  1 TAL1 T-cell acutelymphocytic 5′ 2571 leukemia 1 TCCGAACTTCCGGACCC 1030  2 15  5 0.0037833 1 Not Found GCCCAACCCCGGGGAGT 1031  0  6  6 0.0179052  1 P66betatranscription repressor 5′ 117605 p66 beta component ofTCTGGGGCCGGGTAGCC 1032 28 53  1 0.0231777  1 P66beta transcriptionrepressor 5′ 117605 p66 beta component of GCAGCGGCGCTCCGGGC 1033 20 48 2 0.0034829  1 MUC1 mucin 1, transmembrane 3′ 139119 CTCTCACCCGAGGAGCG1034  0  9  9 0.0203814  2 OACT2 O-acyltransferase (mem- 3′ 47 branebound) domain GCAGCATTGCGGCTCCG 1035 25 58  2 0.0016016  2 SIX2 sineoculis homeobox 5′ 160394 homolog 2 TCATTGCATACTGAAGG 1036  0  5  50.0308794  2 SLC1A4 solute carrier family 5′ 335302 1, member 4TCATTGCATACTGAAGG 1037  0  5  5 0.0308794  2 SERTAD2 SERTA domaincontaining 5′ 245 2 CCCCAGCTCGGCGGCGG 1038 20 53  2 0.0006521  2 TCF7L1HMG-box transcription 3′ 859 factor TCF-3 AAGCAGTCTTCGAGGGG 1039  0  8 8 0.0072167  2 CNNM3 cyclin M3 isoform 1 5′ 396 CCCCCACCCCCCAGCCC 1040 4 17  3 0.0100324  2 TLK1 tousled-like kinase 1 5′ 221TGTAAGGCGGCGGGGAG 1041  3 15  4 0.0093236  2 SP3 Sp3 transcriptionfactor 3′ 1637 ACTGCATCCGGCCTCGG 1042 25  9 −4 0.0116348  2 PTMAprothymosin, alpha 5′ 93674 (gene sequence 28) GGAGGCAAACGGGAACC 1043  0 8  8 0.0316311  3 IQSEC1 IQ motif and Sec7 5′ 315433 domain 1CGGCGCGTCCCTGCCGG 1044 21 44  2 0.0186262  3 DKFZp313N0621 hypotheticalprotein 5′ 339665 DKFZp313N0621 CCACTTCCCCATTGGTC 1045 35 68  10.0057244  3 ARMET arginine-rich, mutated 5′ 633 in early stage tumorsCCTGCCTCTGGCAGGGG 1046  9 31  3 0.0025605  3 PLXNA1 plexin A1 5′ 5386CTCGGTGGCGGGACCGG 1047  7 20  2 0.0253353  3 SCHIP1 schwannomininteract- 3′ 490368 ing protein 1 CGTGTGAGCTCTCCTGC 1048 17 40  20.0105223  3 EPHB3 ephrin receptor EphB3 3′ 576 precursorCCTGCGCCGGGGGAGGC 1049 37 94  2 0.0000051  4 ADRA2C alpha-2C-adrenergic3′ 432 receptor AAAGCACAGGCTCTCCC 1050  0  5  5 0.0308794  4 SLC4A4solute carrier family 5′ 151833 4, sodium bicarbonate TGCGGAGAAGACCCGGG1051  0 11 11 0.0056118  4 ELOVL6 ELOVL family member 6, 3′ 1583elongation of long chain GGAGGTCTCAGGATCCC 1052  0 14 14 0.0007408  5FLJ20152 hypothetical protein 5′ 108193 FLJ20152 GCAGGCTGCAGGTTCCG 1053 2 11  4 0.0248947  5 RAI14 retinoic acid induced 5′ 411295 14GCAGGCTGCAGGTTCCG 1054  2 11  4 0.0248947  5 C1QTNF3 C1q and tumornecrosis 5′ 201285 factor related protein 3 CCCACTTTCAAAGGGGG 1055  0 1313 0.0008961  5 FST follistalin isoform 5′ 517 FST344 precursorCCCACTTTCAAAGGGGG 1056  0 13 13 0.0008961  5 MOCS2 molybdopterinsynthase 5′ 370479 large subunit MOCS2B CCGCTGGTGCACTCCGG 1057  2 13  50.0080417  5 TCF7 transcription factor 7 3′ 252 (T-cell specificCGTCTCCCATCCCGGGC 1058 13 43  2 0.0003622  5 CPLX2 complexin 2 3′ 1498GCTGCGGCCCTCCGGGG 1059  2 10  4 0.0363689  6 ITPR3 inositol1,4,5-triphos- 5′ 179 phate receptor, type 3 GCTGCGGCCCTCCGGGG 1060  210  4 0.0363689  6 FLJ43752 FLJ43752 protein 5′ 28049 GGTCTCCGAAGCGAGCG1061  0  6  6 0.0179052  6 MDGA1 MAM domain containing 3′ 934GCAGCCGCTTCGGCGCC 1062 16 36  2 0.023022  6 EGFL9 EGF-like-domain, 3′134 multiple 9 TCCATAGATTGACAAAG 1063 12  3 −5 0.0358865  6 MARCKSmyristoylated alanine- 3′ 3067 rich protein kinase C GCGAGGGCCCAGGGGTC1064 15 48  2 0.0001996  7 SLC29A4 solute carrier family 3′ 67 29(nucleoside GTCCCCAGCACGCGGTC 1065  2 15  5 0.0037833  7 TBX20 T-boxtranscription 5′ 607 factor TBX20 AACTTGGGGCTGACCGG 1066  7 29  30.0007208  7 AUTS2 autism susceptibility 3′ 1095850 candidate 2GGACGCGCTGAGTGGTG 1067  0  6  6 0.0179052  7 KIAA1862 KIAA1862 protein5′ 148 GGACGCGCTGAGTGGTG 1068  0  6  6 0.0179052  7 FLJ12700hypothetical protein 5′ 90181 FLJ12700 TAATTCGAGCACTTTGA 1069  0  5  50.0308794  8 FLJ20366 hypothetical protein 5′ 1280 FLJ20366AAGAGGCAGAACGTGCG 1070 37 70  1 0.006975  8 KCNK9 potassium channel, 3′360 subfamily K, member 9 AGAGGAGCAGGAAGCGA 1071  0  6  6 0.0179052  9PAX5 paired box 5 3′ 48156 TAAATAGGCGAGAGGAG 1072  6 18  2 0.0274955  9FLJ46321 FLJ46321 protein 5′ 299849 TAAATAGGCGAGAGGAG 1073  6 18  20.0274955  9 TLE1 transducin-like en- 5′ 241 hancer protein 1ATCGAGTGCGACGCCTG 1074  4 14  3 0.0337426  9 PHF2 PHD finger protein 23′ 686 isoform b GGCGTTAATAGAGAGGC 1075  0  5  5 0.0308794  9 PRDM12 PRdomain containing 12 5′ 5017 CTCCCAGTACAGGAGCC 1076  0 12 12 0.0036439 9 RAPGEF1 guanine nucleotide- 5′ 2333 releasing factor 2 isoform aGAGGACAGCCGGCTCGT 1077  6  0 −8 0.0154516  9 LHX3 LIM homeobox protein 33′ 4193 isoform b CAGCCAGCTTTCTGCCC  139  7 22  2 0.0114719  9 LHX3 LIMhomeobox protein 3 5′ 146 isoform b AGCCACCGTACAAGGCC 1079  0 11 110.0056118 10 PFKP phosphofructokinase, 3′ 1056 plateletTGACGGCAAAAGCCGCC 1080  0  8  8 0.0316311 10 EGR2 early growth response2 3′ 1010 protein TGGGAAAGGTCTTGTGG 1081  0 20 20 0.0000356 10 LZTS2leucine zipper, putative 3′ 2691 tumor suppressor 2 CCCCGTGGCGGGAGCGG1082 15 38  2 0.0074135 10 NEURL neuralized-like 5′ 630CCCCGTGGCGGGAGCGG 1083 15 38  2 0.0074135 10 FAM26A family with sequence5′ 14420 similarity 26, member A TTGTGTGTACATAGGCC 1084  0  8  80.0316311 10 SORCS1 SORCS receptor 1 5′ 1301646 isoform aCGGAGCCGCCCCAGGGG 1085  5  0 −7 0.0254984 11 RNH ribonuclease/angiogenin3′ 381 inhibitor TCTAGGACCTCCAGGCC 1086 11 32  2 0.0064141 11 SLC39A13solute carrier family 39 5′ 415 (zinc transporter) TCTAGGACCTCCAGGCC1087 11 32  2 0.0064141 11 SPI1 spleen focus forming 5′ 29668 virus(SFFV) proviral GAGGCCTCTGAGGAGCG 1088  0  9  9 0.0203814 11 OVOL1OVO-like 1 binding 5′ 452 protein GAGGCCTCTGAGGAGCG 1089  0  9  90.0203814 11 DKFZp761E198 hypothetical protein 5′ 6534 DKFZp761E198CGCCCCTTCCGTGCGCC 1090  0  7  7 0.0100816 11 FBXL11 F-box andleucine-rich 5′ 454 repeat protein 11 TCGGAGTCCCCGTCTCC 1091  0  5  50.0308794 12 ANKRD33 ankyrin repeat domain 5′ 73619 33 GCCTGGACGGCCTCGGG1092  5 21  3 0.003569 12 CSRP2 cysteine and glycine- 3′ 185 richprotein 2 ACTGTCTCCGCGAAGAG 1093  4 16  3 0.0139338 12 CSRP2 cysteineand glycine- 3′ 185 rich protein 2 CGAACTTCCCGGTTCCG 1094 14 46  20.0002219 12 Not Found CAGCGGCCAAAGCTGCC 1095  9 29  2 0.0029267 12 RANras-related nuclear 5′ 257 protein CAGCGGCCAAAGCTGCC 1096  9 29  20.0029267 12 EPIM epimorphin isoform 2 5′ 32499 TTTGCTACGTGTACATC 1097 0  6  6 0.0179052 13 RANBP5 RAN binding protein 5 3′ 23155GCGGACGAGGCCCCGCG 1098  0  5  5 0.0308794 13 CUL4A cullin 4A isoform 23′ 322 CCCCCAAGACACATCAA 1099  0 10 10 0.0018237 14 C14orf87 chromosome14 open 5′ 18535 reading frame 87 CCCCCAAGACACATCAA 1100  0 10 100.0018237 14 C14orf49 chromosome 14 open 5′ 40614 reading frame 49GGCCGGTGCCGCCAGTC 1101  6 18  2 0.0274955 14 EML1 echinoderm microtubule5′ 62907 associated protein like 1 GAGGCCAGCCTGAGGGC 1102  0  5  50.0308794 14 C14orf151 chromosome 14 open 5′ 39104 reading frame 151GAGGCCAGCCTGAGGGC 1103  0  5  5 0.0308794 14 FLJ42486 FLJ42486 protein5′ 45756 ACACCTGTGTCACCTGG 1104  0 10 10 0.013797 15 OCA2 P protein 3′2135 GCTCTGCCCCCGTGGCC 1105  0  6  6 0.0179052 15 BAHD1 bromo adjacenthomology 5′ 138 domain containing 1 CCCACCCCCACACCCCC 1106  0  9  90.0203814 16 CPNE2 copine II 5′ 179 GCAGCCCCTTGGTGGAG 1107  3 12  30.0408401 16 TUBB3 tubulin, beta, 4 3′ 843 CCGTGTTGTCCTGCCCG 1108  0 1111 0.0013551 17 MNT MAx binding protein 3′ 228 AAGGTGAAGAAGGGCGG 1109  618  2 0.0274955 17 UNC119 unc119 (Celegans) 3′ 355 homolog isoform aGCCGCGCACAGGCCGGT 1110 12 26  2 0.0499764 17 NF1 neurofibromin 3′ 603CCTACCTATCCCTGGAC 1111  5 21  3 0.003569 17 STAT5A signal transducer and3′ 1085 activator of trans- cription GCCTGACCCTTTTCTGC 1112  0  8  80.0316311 17 CBX2 chromobox homolog 2 5′ 361 isoform 2 ACCCGCACCATCCCGGG 229 15 41  2 0.0026364 17 CBX4 chromobox homolog 4 5′ 4600CGCTATATTGGACCGCA 1114  0  8  8 0.0316311 18 KCTD1 potassium channel 3′90452 tetramerisation domain GCCCGCGGGGCTGTCCC 1115  0  6  6 0.017905218 GALR1 galanin receptor 1 5′ 146 GCCCGCGGGGCTGTCCC 1116  0  6  60.0179052 18 MBP myelin basic protein 5′ 232612 TCTCGGCGCAAGCAGGC 1117 0  7  7 0.0100816 18 SALL3 sal-like 3 3′ 1008 GCGGGTCGGGCCGGGGC 1118  0 6  6 0.0179052 18 NFATC1 nuclear factor of 3′ 4015 activated T-cells,cytosolic CTAGAAGGGGTCGGGGA 1119 17 36  2 0.0356297 19 CALM3 calmodulin3 5′ 129594 CTAGAAGGGGTCGGGGA 1120 17 36  2 0.0356297 19 FLJ10781hypothetical protein 5′ 140 FLJ10781 GCGGCCGCTCGGCAGCC 1121  0  9  90.0055033 19 GLTSCR1 glioma tumor suppressor 5′ 70312 candidate regiongene 1 GCGGCCGCTCGGCAGCC 1122  0  9  9 0.0055033 19 ZNF541 zinc fingerprotein 541 5′ 63752 GCTGCGGCCGGCCGGGG 1123  5 16  2 0.0283658 19 UBE2Subiquitin carrier 5′ 478 protein TCAGCCCAGCGGTATCC 1124  2 11  40.0248947 20 RRBP1 ribosome binding 3′ 270 protein 1 GGGGATTCTACCCTGGG1125  3 26  6 0.0001076 20 ARFGEF2 ADP-ribosylation factor 5′ 93944guanine GGGGATTGTACCCTGGG 1126  3 26  6 0.0001076 20 PREX1 PREX1 protein5′ 62 CCTGCGCCGCCGCCCGG 1127  7 32  3 0.0002443 20 CEBPB CCAAT/enhancerbinding 3′ 446 protein beta CTGGCCGCCGTGCTGGC 1128  0  9  9 0.0203814 20TAF4 TBP-associated factor 4 3′ 243 ACCCTGAAAGCCTAGCC  266  4 16  30.0139338 21 ITGB2 integrin beta chain, 5′ 10805 beta 2 precursorCTGGACAGAGCCCTCGG 1130  0 10 10 0.013797 22 TCF20 transcription factor5′ 128618 20 isoform 2 CTGCCTGCGGAGGCACA 1131  0  5  5 0.0308794 22CELSR1 cadherin EGF LAG seven- 5′ 39397 pass G-type receptor 1AAGAGCCAGGCCACGGG 1132  4 16  3 0.0139338 22 FLJ41993 FLJ41993 protein5′ 2751 GCGGCCGAGGCGACAGC 1133  0  5  5 0.0308794 22 CHKBcholine/ethanolamine 3′ 293 kinase isoform b CGGGGTGCCGAGCCCCG 1134  0 6  6 0.0179052 22 ACR acrosin precursor 5′ 63440 CGGGGTGCCGAGCCCCG 1135 0  6  6 0.0179052 22 ARSA arylsulfatase A 5′ 46630 precursorTGCAAGATACGCGGGGC 1136  0  6  6 0.0 179052 23 AMMECR1 AMMECR1 protein 3′72 The column headings are as in Table 2 except that the MSDK librariescompared are the N-STR-I17 and I-STR-17 MSDK libraries (See Table 3 fordetails of the tissues from which the libraries were made).

The comparison of myoepithelial cells isolated from normal breast tissueto those isolated from in situ carcinoma (DCIS) revealed some dramaticdifferences and indicated relative hypermethylation of the DCISmyoepithelial cells (Tables 9 and 10).

TABLE 9 Chromosomal location and analysis of the frequency of MSDK tagsin the N-MYOEP-4 and D-MYOEP-6 MSDK libraries. Tag Variety Ratio TagCopy Ratio Differential Tag (P < 0.05) Virtual Observed N-MYOEP-4D-MYOEP-6 N-MYOEP-4/ N-MYOEP-4/ N-MYOEP-4 > N-MYOEP-4 < Chr Tag TagVariety Copies Variety Copies D-MYOEP-6 D-MYOEP-6 D-MYOEP-6 D-MYOEP-6  1551 164 131 833 96 529 1.365 1.575 4 1  2 473 122 97 874 72 524 1.3471.668 4 0  3 349 96 81 812 62 529 1.306 1.535 2 0  4 281 88 66 464 50313 1.320 1.482 3 1  5 334 100 81 644 59 362 1.373 1.779 6 0  6 338 8872 391 49 252 1.469 1.552 2 1  7 403 122 99 651 80 435 1.238 1.497 2 3 8 334 96 80 513 53 302 1.509 1.699 2 0  9 349 103 90 743 60 507 1.5001.465 3 1 10 387 116 104 573 58 361 1.793 1.587 2 2 11 379 119 96 514 70330 1.371 1.558 2 0 12 299 98 75 514 63 393 1.190 1.308 1 1 13 138 44 36208 23 133 1.565 1.564 4 1 14 228 69 55 300 35 198 1.571 1.515 1 1 15260 90 71 350 49 227 1.449 1.542 1 1 16 340 104 83 506 55 255 1.5091.984 4 0 17 400 134 99 764 83 589 1.193 1.297 4 3 18 181 44 37 268 26173 1.423 1.549 1 1 19 463 128 99 609 79 443 1.253 1.375 3 1 20 236 7563 392 43 246 1.465 1.593 3 0 21 71 20 13 103 12 69 1.083 1.493 0 1 22217 54 42 291 34 213 1.235 1.366 1 0 X 185 43 36 201 26 177 1.385 1.1360 2 Y 9 Matches 7205 2117 1706 11518 1237 7560 1.379 1.524 55 21 NoMatches 1571 793 5412 1010 5831 0.785 0.928 19 22 Total 7205 3688 249916930 2247 13391 1.112 1.264 74 43 The column headings are as indicatedfor Table 1.

TABLE 10 MSDK tags significantly differentially (p < 0.050) present inN-MYOEP-4 and D-MYOEP-6 MSDK libraries and genes associated with theMSDK tags. Position Distance of AscI of AscI site in site SEQ N- D- Ra-relation from tr. ID MYOEP- MYOEP- tio to tr. Start MSDK Tag NO. 4 6 N/DP valne Chr Gene Description Start (bp) ATTAACCTTTGAAGCCC 1137  17  3  4 0.009539  1 SHREW1 transmembrane protein 3′ 687 SHREW1GCCTCTCTGCGCCTGCC 1138  32 12   2 0.04196  1 GFI1 growth factor inde- 3′4842 pendent 1 CGCAAAAGCGGGCAGCC 1139   9  0   9 0.008683  1 DHX9 DEAH(Asp-Glu-Ala-His) 5′ 139 box polypeptide 9 isoform CGCAAGAGGCGCAGGCA1140   0  5  −6 0.029059  1 WNT3A wingless-type MMTV in- 5′ 59111tegration site family CGCAAGAGGCGCAGGCA 1141   0  5  −6 0.029059  1WNT9A wingless-type MMTV in- 5′ 41 tegration site familyGAGCGGCCGCCCAGAGC 1142  21  4   4 0.004625  1 TAF5L PCAF associatedfactor 3′ 192 65 beta CCCCAGCTCGGCGGCGG 1143 144 83   1 0.014399  2TCF7L1 HMG-box transcription 3′ 859 factor TCF-3 AGAGTGACGTGCTGTGG 1144  7  0   7 0.014679  2 MERTK c-mer proto-oncogene 3′ 281 tyrosine kinaseAAATTCCATAGACAACC 1145  16  0  16 0.000509  2 HOXD4 homeo box D4 3′ 1141TGTATTGCTTCTTCCCT 1146   9  0   9 0.008683  2 ITM2C integral membranepro- 5′ 36609 tein 2C isoform 1 GGGCCGAGTCCGGCAGC 1147  26  5   40.001331  3 CHST2 carbohydrate (N- 3′ 61 acetylglucosamine-6-O)CTCGGTGGCGGGACCGG 1148  23  4   5 0.002085  3 SCHIP1 schwannomininteract- 3′ 490368 ing protein 1 GCGGCGCCCTCTGCTGG 1149   6  0   60.022859  4 FLJ37478 hypothetical protein 5′ 50272 FLJ37478GCGGCGCCCTCTGCTGG 1150   6  0   6 0.022859  4 WHSC2 Wolf-Hirschhorn syn-5′ 565 drome candidate 2 protein TGGCCCCCGCTGCCCGC 1151   6  0   60.022859  4 FLJ37478 hypothetical protein 5′ 74 FLJ37478TGGCCCCCGCTGCCCGC 1152   6  0   6 0.022859  4 WHSC2 Wolf-Hirschhorn syn-5′ 50763 drome candidate 2 protein AGCCACCTGCGCCTGGC 1153   7 17  −30.04018  4 PAQR3 progestin and adipoQ 5′ 101 receptor family member IIICTTAGATCTAGCGTTCC 1154  21  7   2 0.03636  4 DKFZP564J102 DKFZP564J102protein 5′ 4 GGAGGTCTGAGGATGCC 1155  13  0  13 0.006039  5 FLJ20152hypothetical protein 5′ 108193 FLJ20152 TGACAGGCGTGCGAGCC 1156  28  7  3 0.003434  5 MGC33648 hypothetical protein 5′ 92617 MGC33648TGACAGGCGTGCGAGCC 1157  28  7   3 0.003434  5 FLJ11795 hypotheticalprotein 5′ 699674 FLJ11795 CCTACGGCTACGGCCCC 1158   6  0   6 0.022859  5FOXD1 forkhead box D1 3′ 1974 CCACTACTTAAGTTTAC 1159   6  0   6 0.022859 5 UNQ9217 AASA9217 3′ 335 CTGGGTTGCGATTAGCT 1160  23  6   3 0.009778  5PPIC peptidylprolyl iso- 5′ 62181 merase C GTTTCTTCCCGCCCATC 1161  26  6  3 0.003292  5 PHF15 PHD finger protein 15 3′ 1577 TGGTTTACCTTGGCATA 252  11  0  11 0.002278  6 FOXF2 forkhead box F2 5′ 6373CAACCCACGGGCAGGTG  110   0  6  −8 0.01482  6 TAGAP T-cell activation Rho5′ 123822 GTPase-activating protein AAACAGGCGTGCGGGAG 1164   7  0   70.014679  6 T transcription factor T 3′ 1509 ACAAAAATGATCGTTCT 1165   312  −5 0.022893  7 PLEKHA8 pleckstrin homology 3′ 159 domain containing,family A GTCCCCAGCACGCGGTC 1166  21  5   3 0.009372  7 TBX20 T-boxtranscription 5′ 607 factor TBX20 CACTAGACCTGCCTGAG 1167  18  5   30.028555  7 DLX5 distal-less homeo box 3′ 3450 5 TCTGGGGGCAAATACGT 1168  0  7  −9 0.030903  7 CAV1 caveolin 1 3′ 1501 AGTATCAAAACGGCAGC 1169  0  6  −8 0.01482  7 Not Found CGAGGAAGTGACCCTCG 1170   6  0   60.022859  8 CHD7 chromodomain helicase 5′ 156 DNA binding protein 7CGGCTTCCCAGGCCCAC 1171  19  4   4 0.008734  8 FLJ43860 FLJ43860 protein5′ 11074 CAGCGCTACGCGCGGGG 1172   6  0   6 0.022859  9 EPB41L4Berythrocyte membrane 3′ 1346 protein hand 4.1 like 4B GTGGGGGGCGACCTGTC1173  21  4   4 0.004625  9 RGS3 regulator of G-protein 3′ 1569signalling 3 isoform 6 TACGCGGGTGGGGGAGA 1174   3 14  −6 0.007269  9ADAMTS13 a disintegrin-like and 3′ 6658 metalloproteaseAGCCCCCCATTGAAAAG 1175   6  0   6 0.022859  9 OLFM1 olfactomedin related3′ 13681 ER localized protein AAGAGCAAATAAGAGGC 1176   0  9 −11 0.01322610 KI1AA0934 KIAA0934 3′ 138 CTTTTTTTTTCTTTTAA 1177   0  7  −9 0.00688610 MLLT10 myeloid/lymphoid or 5′ 6870 mixed-lineage leukemiaCTTTTTTTTTCTTTTAA 1178   0  7  −9 0.006886 10 FLJ45187 FLJ45187 protein5′ 1620 GAAGCGCTGACGCTGTG 1179  10  0  10 0.021759 10 GRID1 glutamatereceptor, 3′ 1043 ionotropic, delta 1 GTTACGCGCCTGCCTCC 1180   7  0   70.014679 10 GPR123 G protein-coupled 3′ 17484 receptor 123CCAGCCCGGGCCCGGGG 1181   6  0   6 0.022859 11 FDX1 ferredoxin 1precursor 5′ 133525 CCAGCCCGGGCCCGGGG 1182   6  0   6 0.022859 11 RDXradixin 5′ 16634 GCTCAGAGGCGCTGGAA 1183  18  5   3 0.028555 11 ZBTB16zinc finger and BTB 3′ 913 domain containing 16 CCACGTCTTAGCACTCT 1184  9  0   9 0.008683 12 DDXI1 DEAD/H (Asp-Glu-Ala- 5′ 277542 Asp/His) boxpoly- peptide 11 CCACGTCTTAGCACTCT 1185   9  0   9 0.008683 12 C1QDC1C1q domain containing 5′ 41819 1 isoform 2 AAGGCTGGGAGTTTTCT 1186   6 20 −4 0.005935 12 ABCB9 ATP-binding cassette, 3′ 517 sub-family B(MDR/TAP) CAGCATTGTTTTCACCA 1187   0  7  −9 0.030903 13 SGCG gammasarcoglycan 5′ 20979 GGCTTCGGCCCAGGGTG 1188   8  0   8 0.011061 13PABPC3 poly(A) binding pro- 5′ 77913 tein, cytoplasmic 3GGCTTCGGCCCAGGGTG 1189   8  0   8 0.011061 13 CENPJ centromere protein J5′ 95344 CATTCCTTGCGTGGCTC 1190   7  0   7 0.014679 13 CDX2 caudal typehomeo box 3′ 1338 transcription factor 2 GTGACCCCCGCCCCTCC 1191   6  0  6 0.022859 13 FOXO1A forkhead box O1A 3′ 37 TTTGCTACGTGTACATC 1192   7 0   7 0.014679 13 RANBP5 RAN binding protein 5 3′ 23155GCCACGAGCCCTAGCGG 1193   0  6  −8 0.01482 14 FLJ10357 hypotheticalprotein 5′ 22 FLJ10357 GCCCCACGCCCCCTGGC 1194  29  8   3 0.004647 14C14orf153 chromosome 14 open 5′ 681 reading frame 153 GCCCCACGCCCCCTGGC1195  29  8   3 0.004647 14 BAG5 BCL2-associated 5′ 19 athanogene 5AGAGCTGAGTCTCACCC 1196   5 14  −4 0.042959 15 CDAN1 codanin 1 3′ 359GAGCTGCCTGCTTCCCC 1197  13  3   3 0.037287 15 SIN3A transcription co- 5′2969 repressor Sin3A CAGGACGACTCAAAGGC 1198   6  0   6 0.022859 16ATP6V0C ATPase, H′ transport- 5′ 17685 ing, lysosomal, V0 subunitCGATTCGAACCCAGGGG 1199  42 13   3 0.003577 16 IRX6 iroquois homeobox 5′386305 protein 6 GTGCAGTCTCGGCCCGG 1200  33  2  13 0.00001 16 FBXL8F-box and leucine-rich 3′ 3905 repeat protein 8 TTTGCTTAGAGCCCAGC 1201  6  0   6 0.022859 16 SLC7A6 solute carrier family 3′ 74 7 (cationicamino acid) CCTACCTATCCCTGGAC 1202  21  5   3 0.009372 17 STAT5A signaltransducer and 3′ 1085 activator of transcription GCTATGGGTCGGGGGAG  215  0 29 −37 0 17 SOST sclerostin recursor 3′ 3140 CTGACGGGCACCGAGCC 1204  6  0   6 0.022859 17 TBX21 T-box 21 3′ 715 CCCCGTTTTTGTGAGTG  221  1024  −3 0.0135 17 HOXB9 homeo box B9 5′ 20620 GCCCAAAAGGAGAATGA 1206   516  −4 0.01586 17 PHOSPHO1 phosphatase, orphan 1 3′ 5786GCCCGGCGGGCCTCCGG 1207   6  0   6 0.022859 17 CD300A leukocyte membrane5′ 12316 antigen CCCCTGCCCTGTCACCC  226  28  0  28 0.000028 17 SLC9AR1solute carrier family 3′ 11941 9 (sodium/hydrogen) GAAAAGTTGAACTCCTG1209   0  6  −8 0.01482 18 C18orf1 chromosome 18 open 3′ 20803 readingframe 1 isoform alpha GTGGAGGGGAGGTACTG 1210  12  0  12 0.008257 18IER3IP1 immediate early re- 5′ 70905 sponse 3 interacting proteinCGTGCGCCCGGGCTGGC 1211   7  0   7 0.014679 19 UHRF1 ubiquitin-like, con-5′ 1499 taining PHD and RING finger CGTGCGCCCGGGCTGGC 1212   7  0   70.014679 19 M6PRBP1 mannose 6 phosphate 5′ 41638 receptor bindingprotein 1 ATCGTAGCTCGCTGCAG 1213   0  5  −6 0.029059 19 FLJ23420hypothetical protein 5′ 75 FLJ23420 CACGAAGCCGCCGGGCC 1214   6  0   60.022859 19 KLF2 Kruppel-like factor 3′ 540 TTCGGCCCCATCCCTCG  313  22 0  22 0.000068 19 CDC42EP5 CDC42 effector 3′ 8020 protein 5GACAGACCCGGTCCCTG 1216   6  0   6 0.022859 20 RRBP1 ribosome binding 3′270 protein 1 TCCAGAGGCCCGAGCTC 1217  24  8   2 0.024137 20 PPP1R3Dprotein phosphatase 3′ 627 1, regulatory subunit 3D CTTCGACTCCGGAGGCC1218   7  0   7 0.014679 20 CDH4 cadherin 4, type 1 5′ 490627preproprotein CAATCACGAATTTGTTA 1219   0  5  −6 0.029059 21 HMGN1high-mobility group 3′ 131 nucleosome binding domain 1 CACCGGGCGCAGTAGCG1220  27  9   2 0.016802 22 Not Found GGTCTCCTGAGGACCAG 1221   0  8 −100.021437 23 Not Found CTCGCATAAAGGCCACC 1222   0  7  −9 0.006886 23LAMP2 lysosomal-associated 5′ 16644 membrane protein 2 The columnheadings are as in Table 2 except that the MSDK libraries are theN-MYOBP-4 and D-MYOEP-6 MSDK libraries (see Table 3 for details of thetissues from which the libraries were made).

Besides identifying epigenetic differences between normal and tumortissue, cell type-specific differences in methylation patterns were seenby comparing MSDK libraries generated from normal epithelial and normalmyoepithelial cells (Tables 11 and 12). Epithelial and myoepithelialcells are thought to originate from a common bi-potential progenitorcell [Bocker et al. (2002) Lab. Invest. 82:737-746]. The methylationdifferences observed between these two cell types raise the possibilityof their different clonal origin or epigenetic reprogramming of thecells during lineage specific differentiation. Indeed, during embryonicdevelopment, epigenetic changes are known to occur in a cell lineagespecific manner and play a role in differentiation [Kremenskoy et al.(2003) Biochem. Biophys. Res. Commun. 311:884-890].

TABLE 11 Chromosomal location analysis of the frequency of MSDK tags inthe N-MYOEP-4 and N-EPI-I7 MSDK libraries. Tag Variety Ratio Tag CopyRatio Differential Tag (P < 0.05) Virtual Observed N-MYOEP-4 N-EPI-I7N-MYOEP-4/ N-MYOEP-4/ N-MYOEP-4 > N-MYOEP-4 < Chr Tags Tags VarietyCopies Variety Copies N-EPI-I7 N-EPI I7 N-EPI-I7 N-EPI-I7  1 551 163 131833 98 496 1.337 1.679 4 2  2 473 112 97 874 62 517 1.565 1.691 6 1  3349 101 81 812 58 535 1.397 1.518 2 1  4 281 80 66 464 42 244 1.5711.902 1 2  5 334 99 81 644 55 399 1.473 1.614 4 4  6 338 89 72 391 50245 1.440 1.596 1 1  7 403 116 99 651 61 340 1.623 1.915 5 2  8 334 9780 513 51 300 1.569 1.710 1 2  9 349 106 90 743 60 405 1.500 1.835 8 010 387 121 104 573 59 378 1.763 1.516 2 4 11 379 113 96 514 69 327 1.3911.572 1 4 12 299 93 75 514 49 331 1.531 1.553 1 0 13 138 38 36 208 20108 1.800 1.926 1 1 14 228 63 55 300 28 165 1.964 1.818 1 0 15 260 84 71350 40 158 1.775 2.215 1 0 16 340 103 83 506 55 279 1.509 1.814 1 1 17400 124 99 764 70 496 1.414 1.540 4 2 18 181 42 37 268 19 125 1.9472.144 3 1 19 463 130 99 609 83 388 1.193 1.570 4 2 20 236 75 63 392 38244 1.658 1.607 2 0 21 71 14 13 103 8 69 1.625 1.493 0 0 22 217 49 42291 31 205 1.355 1.420 0 1 X 185 39 36 201 19 116 1.895 1.733 0 1 Y 9Matches 7205 2051 1706 11518 1125 6870 1.516 1.677 53 32 No Matches 1532793 5412 930 4463 0.853 1.213 34 29 Total 7205 3583 2499 16930 205511333 1.216 1.494 87 61 The column headings are as indicated for Table1.

TABLE 12 MSDK tags significantly (p < 0.050) differentially present inN-MYOEP4 and N-EPI-I7 MSDK libraries and genes associated with the MSDKtags. Position of AscI Ratio N- site in Distance of SEQ N- N- MYOEP-relation AscI site ID MYOEP- EPI- 4/N-EPI- to tr. from tr. MSDK Tag NO.4 I7 I7 P value Chr Gene Description Start Start (bp) AGCACCCGCCTGGAACC223   3 13  −6 0.008872  1 PTPRF protein tyrosine 3′ 727 phosphatase,receptor type, F TCCGAACTTCCGGACCC 224  10  0  10 0.004784  1 Not FoundTCTGGGGCCGGGTAGCC 225  36  9   3 0.007572  1 P66beta transcription 5′117605 repressor p66 beta component of GCAGCGGCGCTCCGGGC 226  38  9   30.004154  1 MUC1 mucin 1, 3′ 139119 transmembrane AGCCCTCGGGTGATGAG  29 27  7   3 0.012636  1 LMX1A LIM homeobox 5′ 752 transcription factor 1,alpha ACGTTTTTAACTACACA 228   0 11 −16 0.003192  1 ELK4 ELK4 protein 3′621 isoform a GCCACCCAAGCCCGTCG 229  11  0  11 0.003665  2 RAB10ras-related GTP- 5′ 106 binding protein RAB10 GCCACCCAAGCCCGTCG 230  11 0  11 0.003665  2 KIF3C kinesin family 5′ 51464 member 3CGCAGCATTGCGGCTCCG 231 102  42   2 0.00343  2 SIX2 sine oculis 5′ 160394homeobox homolog 2 CACACAAGGCGCCCGCG 232  17  4   3 0.039281  2 SIX2sine oculis 5′ 160394 homeobox homolog 2 CTGGAGCTCAGCACTGA 233  10  0 10 0.032551  2 Not Found CCCCAGCTCGGCGGCGG 234 144 76   1 0.038423  2TCF7L1 HMG-box 3′ 859 transcription factor TCF-3 CGTGGCCGGTCAGTGCC 235  7  0   7 0.016949  2 ARHGEF4 Rho guanine 3′ 123018 nucleotide exchangefactor 4 isoform GGCGCCAGAGGAAGATC 236   6 16  −4 0.021688  2 SSBautoantigen La 5′ 29950 CGGCGGGGCAGCCGACG 237  19  4   3 0.018727  3CCR4 chemokine (C-C 5′ 133333 motif) receptor 4 CGGCGCGTCCCTGCCGG 238 75 33   2 0.031796  3 DKFZp313 hypothetical 5′ 339665 N0621 proteinDKFZp313N062 1 CACACCCCGCCCCCAGC 239   0 39 −58 0  3 ACTR8 actin-related3′ 338 protein 8 TGCGGCGCGGGGCGGCC 240  11  0  11 0.018565  4 ZFYVE28zinc finger, 3′ 107 FYVE domain containing 28 GTCCGTGGAATAGAAGG 241   0 8 −12 0.002774  4 Not Found TTTCTTTTATGCAGTTC 242   0  8 −12 0.002774 4 CAMK2D calcium/calmodu- 5′ 26 lin-dependent protein kinase IIATTTAGTTCTTGTTTTG 243   0  5  −7 0.026319  5 NPR3 natriuretic 5′ 304peptide receptor C/guanylate cyclase TGACAGGCGTGCGAGCC 244  28  2   90.000182  5 MGC33648 hypothetical 5′ 92617 protein MGC33648TGACAGGCGTGCGAGCC 245  28  2   9 0.000182  5 FLJ11795 hypothetical 5′699674 protein FLJ11795 ACCCGGGCCGCAGCGGC 246   3 13  −6 0.008872  5EFNA5 ephrin-A5 3′ 1019 CGGCCGCTCAGCAACTT 247   0  8 −12 0.015444  5KCNN2 small 3′ 832 conductance calcium- activated potassiumACACATTTATTTTTCAG 248   5 15  −4 0.01736  5 KIAA1961 KIAA1961 3′ 146protein isoform 1 TCTCTTGGGGAGATGGG 249   7  0   7 0.016949  5 PACAPproapoptotic 5′ 4496 caspase adaptor protein CTGACCGCGCTCGCCCC  91  26 0  26 0.000147  5 PACAP proapoptotic 5′ 4496 caspase adaptor proteinTCCGACAAGAAGCCGCC 251  14  0  14 0.007231  5 MSX2 msh homeo box 3′ 605homolog 2 TGGTTTACCTTGGCATA 252  11  0  11 0.003665  6 FOXF2 forkheadbox F2 5′ 6373 AAGGAGACCGCACAGGG 253   3 10  −5 0.042045  6 HTR1E 5- 5′97 hydroxytrypta- mine (serotonin) receptor 1E AAGGAGACCGCACAGGG 254   310  −5 0.042045  6 SYNCRIP synaptotagmin 5′ 1294285 binding, cytoplasmicRNA GGGGGGGAACCGGACCG 255  15  0  15 0.000992  7 ACTB beta actin 3′ 865GTGCGGCCGCCGCGGCC 256  15  3   3 0.029313  7 C7orf26 chromosome 7 5′ 362open reading frame 26 AACTTGGGGCTGACCGG 257  19  0  19 0.001464  7 AUTS2autism 3′ 1095850 susceptibility candidate 2 CCTTGACTGCCTCCATC 258  22 5   3 0.014564  7 WBSCR17 Williams Beuren 5′ 512 syndrome chromosomeregion 17 TAAAATAAACTCAGGAC 259   0  7 −10 0.030545  7 SEMA3C semaphorin3C 3′ 214 CACTAGACCTGCCTGAG 260  18  3   4 0.009065  7 DLX5 distal-lesshomeo 3′ 3450 box 5 AGTATCAAAACGGCAGC 261   0  5  −7 0.026319  7 NotFound GGGGCCTATTCACAGCC 262   0  8 −12 0.015444  8 TNKS tankyrase, TRF1-5′ 404285 interacting ankyrin-related GGGGCCTATTCACAGCC 263   0  8 −120.015444  8 PPP1R3B protein 5′ 953 phosphatase 1, regulatory (inhibitorCCCATCCCCCACCCGGA 264   0  5  −7 0.026319  8 LOXL2 lysyl oxidase-like 3′403 2 AAGTTGGCCAGCTCGGG 265   7  0   7 0.016949  8 SCRIB scribbleisoform 3′ 194 b TCTGTGTGCTGTGTGCG 266  14  2   5 0.017367  9 SMARCA2SWI/SNF-related 3′ 1580 matrix-associated ATCGAGTGCGACGCCTG 267  10  0 10 0.032551  9 PHF2 PHD finger 3′ 686 protein 2 isoform bGGTGGAGGCAGGCGGGG 268   7  0   7 0.016949  9 TXN thioredoxin 3′ 266GTGGGGGGCGACCTGTC 269  21  3   5 0.003859  9 RGS3 regulator of G- 3′1569 protein signalling 3 isoform 6 GCCTTCGACCCCCAGGC 270  16  3   40.020923  9 BTBD14A BTB (POZ) 5′ 98790 domain containing 14ACAGCCAGCTTTCTGCCC 139  66 28   2 0.034004  9 LHX3 LIM homeobox 5′ 146protein 3 isoform b GGGGAAGCTTCGAGCGC 272  20  4   3 0.013339  9 NotFound AGGCAACAGGCAGGAAG 273   7  0   7 0.016949  9 CACNA1B calciumchannel, 3′ 86 voltage- dependent, L type AAAATAGAGGTTCCTCC 274   4 34−13 0 10 PRPF18 PRP18 pre- 5′ 58621 mRNA processing factor 18 homologAAAATAGAGGTTCCTCC 275   4 34 −13 0 10 C10orf30 chromosome 10 5′ 25417open reading frame 30 AATGAACGACCAGACCC 276  15 35  −3 0.000614 10 DDX21DEAD (Asp- 3′ 506 Glu-Ala-Asp) box polypeptide 21 CAACTGGCCCCAACTAG 277  8  0   8 0.012577 10 CDH23 cadherin related 3′ 159 23 isoform 2precursor AGTTAGTTCCCAACTCA 278   0  5  −7 0.026319 10 MLR2ligand-dependent 5′ 84 corepressor AGTTAGTTCCCAACTCA 279   0  5  −70.026319 10 PIK3AP1 phosphoinositide- 5′ 112373 3-kinase adaptor protein1 CCGCGCTGAGGGGGGGC 280  11  0  11 0.018565 10 CTBP2 C-terminal 3′ 1219binding protein 2 isoform 1 GGGCCCCGCCCAGCCAG 281   0 14 −21 0.000103 10C10orf137 erythroid 5′ 556810 differentiation- related factor 1GGGCCCCGCCCAGCCAG 282   0 14 −21 0.000103 10 CTBP2 C-terminal 5′ 2249binding protein 2 isoform 1 TCTAGGACCTCCAGGCC 283  30 53  −3 0.000667 11SLC39A13 solute carrier 5′ 415 family 39 (zinc transporter)TCTAGGACCTCCAGGCC 284  30 53  −3 0.000667 11 SPI1 spleen focus 5′ 29668forming virus (SFFV) proviral TCCAGCCCACCTGACAG 285   0  7 −10 0.03054511 FLJ22794 FLJ22794 5′ 1744 protein GAGCAGCCAGGGCCGGA 286  14  0  140.007231 11 FBXL11 F-box and 5′ 454 leucine-rich repeat protein 11AGCCACGCACCCAGACT 287   0  5  −7 0.026319 11 PIG8 translokin 3′ 649AGGGAAGCAGAAAGGCC 288   0  5  −7 0.026319 11 MGC39545 hypothetical 3′1123 protein LOC403312 GCCGCCACTGCCTCAGG 289  23  5   3 0.010564 12 DTX1deltex homolog 1 5′ 312 GTAGGTGGCGGCGAGCG 290  18  0  18 0.001868 13USP12 ubiquitin-specific 3′ 653 protease 12-like 1 GATATCAAGGTCGCAGA 291  2  8  −6 0.049231 13 GTF3A general 3′ 126 transcription factor IIIAGGCCGGTGCCGCCAGTC 292  18  3   4 0.009065 14 EML1 echinoderm 5′ 62907microtubule associated protein like 1 GCCCCGGCCGCCGCGCC 293  20  4   30.013339 15 Not Found GTGCAGTCTCGGCCCGG 294  33  2  11 0.000043 16 FBXL8F-box and 3′ 3905 leucine-rich repeat protein 8 GGGATCCTCTTGCAAAG 295  5 14  −4 0.029708 16 DNCL2B dynein, 5′ 939218 cytoplasmic, lightpolypeptide 2B GGGATCCTCTTGCAAAG 296   5 14  −4 0.029708 16 MAF v-maf 5′1024 musculoaponeur- otic fibrosarcoma oncogene CCGTGTTGTCCTGCCCG 297 21  3   5 0.003859 17 MNT MAX binding 3′ 228 protein CCACACCTCTCTCCAGG298  11  0  11 0.003665 17 SENP3 SUMO1/sentrin/ 5′ 326 SMT3 specificprotease 3 GGCAACCACTCAGGACG 299  17  2   6 0.0053 17 HCMOGT- spermantigen 3′ 69709 1 HCMOGT-1 GCTATGGGTCGGGGGAG 215   0 45 −67 0 17 SOSTsclerostin 3′ 3140 precursor GCCGCTGCGGCTGCAGC 301   0  5  −7 0.02631917 MGC29814 hypothetical 5′ 24968 protein MGC29814 GCCGCTGCGGCTGCAGC 302  0  5  −7 0.026319 17 RNF157 ring finger 5′ 89 protein 157CCCCAGGCCGGGTGTCC 303  33  9   2 0.018119 17 CBX8 chromobox 5′ 16730homolog 8 GCGGGCGCGGCTCTGGG 304  11  0  11 0.003665 18 TUBB6 tubulin,beta 6 5′ 689 CGAGGGATCTAGGTAGC 305   0  5  −7 0.026319 18 FHOD3 forminhomology 5′ 30 2 domain containing 3 GTGGAGGGGAGGTACTG 306  12  0  120.01257 18 IER3IP1 immediate early 5′ 70905 response 3 interactingprotein TGCTTTTCTGCCCCACT 307   7  0   7 0.016949 18 KIAA0427 KIAA04275′ 530689 TGCTTTTCTGCCCCACT 308   7  0   7 0.016949 18 SMAD2 Sma- andMad- 5′ 77514 related protein 2 GATTTGTTGCAGGGTCT 309  14  0  140.007231 19 AMH anti-Mullerian 3′ 2281 hormone GGCCCCGCCCACAGCCC 310   7 0   7 0.016949 19 2NF560 zinc finger 5′ 18 protein 560TAGGTTCTATGCTCAGT 311   0  5  −7 0.026319 19 AKAP8L A kinase 5′ 13794(PRKA) anchor protein 8-like GTTTATTCCAAACACTG 312   3 10  −5 0.04204519 GRIN2D N-methyl-D- 3′ 48538 aspartate receptor subunit 2DTTCGGCCCCATCCCTCG 313  22  0  22 0.000508 19 CDC42EP5 CDC42 effector 3′8020 protein 5 GCTGCGGCCGGCCGGGG 314  11  0  11 0.018565 19 UBE2Subiquitin carrier 5′ 478 protein CGCTCCCACGTCCGGGA 315  15  3   30.029313 20 SNTA1 acidic alpha 1 3′ 288 syntrophin CTTTCAAACTGGACCCG 316 16  3   4 0.020923 20 Not Found TTCCAAAAAGGGGCAGG 317   2  9  −70.027716 22 XBP1 X-box binding 5′ 82906 protein 1 TAGTACTTTCAGGTAGG 318  2  8  −6 0.049231 23 UBE2A ubiquitin- 3′ 285 conjugating enzyme E2Aisoform 2 The column headings are as in Table 2 except that the MSDKlibraries compared are the N-MYOEP-4 and N-EPI-I7 MSDK libraries (seeTable 3 for details of the tissues from which these libraries weremade).

In addition to pair-wise comparison of MSDK libraries, genome-wideanalyses of methylation and gene expression patterns were performed bycombining MSDK and SAGE (Serial Analysis of Gene Expression) data foreach breast cell type. The AscI cutting frequencies were determined andSAGE tag counts were superimposed (details in Example 1). They were thenmapped to the human genome together with all predicted CpG islands andAscI sites. Based on the combined as well as cell-type-specific MSDK andSAGE analysis, it was determined that highly expressed genes arepreferentially located in gene dense areas [Caron et al. (2001) Science291:1289-1292] and that these areas correlate with the locations of themost frequently cut (thus unmethylated) AscI sites. Interestingly, whilethe ratio of the observed and predicted MSDK tags averaged for all cellstested was nearly equal for most chromosomes, chromosomes X and 17 had alower and a higher observed/expected tag ratio, respectively, in allsamples suggesting overall hyper- and hypo-methylation in these specificchromosomes (Tables 1, 2, and 4-12).

Example 4 Confirmation of MSDK Results by Sequencing Studies

To confirm the MSDK results, several highly differentially methylatedgenes from each pair-wise comparison were selected and their methylationwas analyzed by performing sequence analysis of bisulfite treatedgenomic DNA from the same sample that was used for MSDK and also fromadditional samples obtained from independent patients. These genesincluded PRDM14 and ZCCHC14 (hypermethylated in tumor epithelial cells),HOXD4 and SLC9A3R1 (hypermethylated in DCIS myoepithelial cells) andLOC389333 (more methylated in myoepithelial than in epithelial cells),CDC42EP5 (hypermethylated in DCIS myoepithelial cells and also differentbetween normal epithelial and myoepithelial cells), and Cxorf12(hypermethylated in tumor stroma compared to normal) (FIGS. 9-15).Interestingly PRDM14 and HOXD4 were also differentially methylatedbetween HCT 116 WT and DKO cells (unmethylated in DKO) suggesting theirpotential involvement in multiple tumor types or location in achromosomal area prone to epigenetic modifications. In all these casesbisulfite sequence analysis confirmed the MSDK results although theabsolute frequency of methylation was somewhat variable among samples.

In FIGS. 16A-22B are shown the nucleotide sequences of the gene regionsthat were subjected to the above methylation-detecting sequencinganalysis.

Example 5 Determination of Frequency and Consistency of MethylationDifference by Quantitative Methylation Specific PCR (qMSP)

To determine how frequently and consistently methylation differences inthese selected genes occur, a quantitative methylation specific PCR(qMSP) assay was developed for some of the genes and their methylationstatus in a larger set of samples and in multiple cell types wasanalyzed. This assay depends on the relative ability of two sets of PCRprimers targeting segments of DNA that include at least one CpG sequenceto anneal to bisulfite treated DNA and cause the amplification of thesequence that the primers span. One set of primers is designed to annealto the target sequences efficiently and cause the relatively rapidamplification if the target sequences in the DNA are not methylated andthe other pair of primers is designed to act similarly if the targetsequences in the DNA are methylated.

This analysis not only confirmed the original MSDK data and thebisulfite sequencing results, but also revealed the methylation statusof each gene in all three cell types both in normal and tumor tissue(FIGS. 23A-E). The frequency of PRDM14 methylation was further analyzedin a panel of normal breast tissue (purified organoids), benign breasttumors (fibroadenomas, fibrocystic dysplasias, and papillomas), andbreast carcinomas (FIG. 24). The majority of breast carcinomasdemonstrated high methylation of PRDM14, while only one out of 10 normalbreast tissue samples, and a few benign tumors had low levelmethylation. Based on these data, PRDM14 is a candidate biomarker forbreast cancer diagnosis since it is methylated in 90% of invasive tumorsand only 10% of normal breast tissue.

In addition, a MSP analysis of genomic DNA from a variety of pancreatic,prostate, lung, and breast cancer samples indicated that the PRDM14 geneis hypermethylated in a wide range of cancers (Table 13). Bisulfitetreated DNA from the various cancer and normal tissues was amplifiedwith: (a) a pair of PCR primers that effectively anneals only tomethylated target sequences and causes the production of a detectablePCR product; and (b) and pair of primers that effectively only annealsto unmethylated target sequences and causes the production of adetectable PCR product.

TABLE 13 Methylation of the PRDM14 gene in pancreatic, prostatic, lung,and breast cancer. M % U WM M Total U % (M + WM) Pancreas N 7 1 1 9 77.822.2 N in CA 2 0 0 2 100.0 0.0 CA 1 1 5 7 14.3 85.7 Prostate N 6 0 0 6100.0 0.0 N in CA 2 0 2 4 50.0 50.0 CA 2 1 2 5 40.0 60.0 Xenograft 0 0 77 0.0 100.0 Lung N 4 0 0 4 100.0 0.0 N in CA 6 0 6 12 50.0 50.0 CA 14 387 104 13.5 86.5 Cell lines 0 0 4 4 0.0 100.0 Breast N 2 1 0 3 66.7 33.3N in CA 0 1 0 1 0.0 100.0 CA 40 7 91 138 29.0 71.0 N, normal tissue froma healthy person (not a cancer patient). N in CA, normal tissue adjacentto cancer tissue. CA, cancer tissue. Xenograft, cancer tissue grown innude mice. U, PCR product was detectable (on electrophoretic gels) onlyin PCR with unmethylated target-specific PCR primers. WM (weaklymethylated), PCR product was detectable (on electrophoetic gels) in PCRwith both methylated and unmethylated target-specific PCR primers, butthe methylated primer specific PCR was weak compared to the othersample. The numbers in the M, WM, M, and Total columns are the numbersof different samples tested.

Example 6 Analysis of Gene Expression by Quantitative RT-PCR (qRT-PCR)

To further characterize the effect of methylation changes on geneexpression, the expression of selected genes in cells purified fromnormal breast tissue, and in situ and invasive breast carcinomas wasanalyzed by RT-PCR (FIGS. 25A-D). Of the four genes analyzed both formethylation and gene expression, only one (Cxorf12) had thedifferentially methylated sites localized in the predicted promoterarea, while in the other three genes (PRDM14, HOXD4, and CDC42EP5) thedifferentially methylated AscI and surrounding CpG sites were located inan intron or distal exon. Consistent with these findings, the relativeexpression of Cxorf12 was positively correlated with methylation, whilethat of the other three genes was inversely correlated methylation.Thus, in all cases there was a strong correlation between differentialmethylation of the genes and their differential expression, but onlymethylation in the promoter area was associated with down-regulation ofexpression; in other regions it correlated with higher mRNA levels.These results are consistent with prior reports indicating thatmethylation in non-core (i.e., outside of the promoter) regions do notnegatively affect transcription [Ushijima (2005) Nat. Rev. Cancer5:223-231] and in some cases (e.g. H19/IGF2, an imprinted gene) DNAmethylation in an intron leads to increased gene expression [Feinberg etal. (2004) Nat. Rev. Cancer 4:143-153; Bell et al. (2000) Nature 405,482-485]. The imprinting of IGF2 is dependent on CTCF binding to anenhancer-blocking element within the H19 gene, the methylation of whichinhibits CTCF binding and leads to loss of imprinting (LOI) [Feiber etal. (2004) supra; Bell et al. (2000) supra]. Interestingly, thedifferentially methylated regions identified in the PRDM14 and CDC42EP5genes (see above) appear to have a CTCF binding site [Bell et al. (2000)supra]. Thus, some of the genes identified herein are potentiallysubject to imprinting and the results presented above indicate possibleloss of imprinting in a cell type and tumor stage specific manner.

In summary, a novel sequence-based method (Methylation Specific DigitalKaryotyping; MSDK) for the analysis of the genome-wide methylationprofiles is provided. MSDK analysis of three cell types (epithelial andmyoepithelial cells and stromal fibroblasts) from normal breast tissueand in situ and invasive breast carcinomas revealed that distinctepigenetic changes occur in all three cell types during breasttumorigenesis. Alterations in stromal and myoepithelial cells thuslikely play a role in the establishment of the abnormal tumormicroenvironment and contribute to tumor progression.

A number of embodiments of the invention have been described.Nevertheless, it will be understood that various modifications may bemade without departing from the spirit and scope of the invention.Accordingly, other embodiments are within the scope of the followingclaims.

Example 7 Determination of the Global DNA Methylation of Stem Cells andTheir Differentiated Progeny

To determine the global methylation profile of putative normal mammaryepithelial stem cells and their differentiated progeny, cells werepurified from normal human breast tissue using known cell type specificcell surface markers (see FIG. 26A). Mammary epithelial stem cells wereidentified as lineage⁻/CD24^(−/low)/CD44⁺ cells, while differentiatedluminal epithelial cells were purified using anti-MUC1 and anti-CD24antibodies, and myoepithelial cells were isolated using anti-CD10antibodies. Hereafter, the putative normal mammary epithelial stem cellsare referred to as CD44+ cells, the luminal epithelial cells as MUC1+ orCD24+ cells, and myoepithelial cells as CD10+ cells. The purity anddifferentiation status of the cells was confirmed by analyzing theexpression of known differentiated (e.g., MUC1, MME) and mammary stemcell (e.g., IGFBP7, LRP1) markers by semi-quantitative RT-PCR (see FIG.26B). SAGE (Serial Analysis of Gene Expression) libraries were alsogenerated from each cell fraction to analyze their global expressionprofile. The SAGE data further confirmed the hypothesis that CD44+ cellsrepresent stem cells while MUC1+, CD24+, and CD10+ cells represent adifferentiated lineage of committed cells, since known luminal andmyoepithelial lineage specific and stem markers were found mutuallyexclusively in the respective SAGE libraries.

Example 8 Analysis of MSDK Data Obtained from Isolated Stem Cells andTheir Differentiated Progeny

MSDK libraries were generated using genomic DNA isolated from CD44+,CD24+, MUC1+, and CD10+ cells purified as described above (see FIGS. 26Aand 26B). By comparing the actual number of MSDK tags obtained in eachlibrary to the expected or predicted number of MSDK tags, normal mammaryepithelial stem cells (CD44+) were found to be hypomethylated comparedto luminal epithelial (CD24+ or MUC1+) and myoepithelial (CD10+) cells(see Table 14). Table 15 lists tags statistically significantly (p<0.05)differentially present in the four MSDK libraries.

TABLE 14 Chromosomal location and analysis of the frequency of MSDK tagsin Stem and Differentiated Cells. CD10 CD24 CD44 MUC1 Chr Virtual TagObserved Tag Variety Copies Variety Copies Variety Copies Variety Copies 1 588 182 134 811 95 363 145 1004 147 854  2 470 135 98 848 75 393 1121005 107 826  3 354 119 83 760 61 329 103 1007 91 824  4 298 86 63 46940 181 68 535 65 449  5 352 108 75 702 64 275 89 910 92 719  6 352 10170 411 43 120 85 543 79 421  7 418 146 100 608 76 261 126 781 128 672  8343 107 80 474 66 210 89 598 80 437  9 382 131 95 770 80 365 116 980 102724 10 403 134 92 573 66 282 107 811 106 666 11 392 130 94 526 68 224106 677 100 550 12 318 98 73 587 51 272 82 822 79 635 13 149 44 32 22826 97 35 296 39 264 14 242 64 47 368 35 149 50 472 45 345 15 270 82 55252 43 117 70 340 66 270 16 350 108 69 485 49 179 86 585 78 520 17 421138 109 795 69 328 117 1043 103 756 18 186 65 46 248 26 111 52 368 53256 19 483 140 101 561 69 250 113 660 112 598 20 246 69 55 373 39 167 56434 54 372 21 78 21 18 80 9 24 16 92 18 55 22 232 69 47 371 32 144 56494 56 387 X 192 52 40 259 27 93 43 372 36 236 Y 12 0 0 0 0 0 0 0 0 0Mapped 7531 2329 1676 11559 1209 4934 1922 14829 1836 11836 Not Mapped339 123 86 608 76 458 95 773 100 726 No Match 0 3934 1218 6224 2174 74281181 6909 1202 6043 Total 7870 6386 2980 18391 3459 12820 3198 225113138 18605 The column headings are as indicated for Table 1, for theindicated purified cell populations, CD10, CD24, CD44, and MUC1.

TABLE 15 List of tags statistically significantly (p < 0.05)differentially present in the four Stem and Differentiated Cell MSDKlibraries. SEQ ID Asci MSDK-Tag NO: CD10 CD24 CD44 Muc1 pValue ChrPosition Up-Gene Dn-Gene TAAGGCTAGACAGAAGA 1319  50  83  39   32  4.22E−16 GAAACTCCACAAAAAGA 1320  25  61  31   34  4.11E− 11 GCCTTTCATAGAGCAGG1321  42  88  62   58  4.73E− 11 GGGCCCCGCCCAGCCAG 1322   0   7   0   23 1.06E− 10 126841258 CTBP2 C10orf13 09 7 TTTAGTGCTTCCTTCAG 1323  40  63 34   36  8.56E−  2 192452398 FLJ22833 SDPR 09 TCGCCGGGCGCTTGCCC   90 18   7  66   26  9.55E−  5 134391719 PITX1 PITX1 08 GTCCTTGTTCCCATAGG  97   6   0  35    9  1.21E−  6 1550618 FOXF2 07 AGCCACCACGCCCAGCC 1326  0   8   0    0  1.69E− 07 CCCCTGCCCTGTCACCC  226  30   9   1   25 7.76E− 17 70268314 SLC9A3 07 R1, NAT9 AAAAAAACCCGTTTCCA 1328  17  29  6   19  1.07E− 06 CGCGTCACTAATTAGAT 1329 261 173 384  384  1.58E− 06GGGGCGAAGAAAGCAGA 1330  45  15  83   29  6.56E− X 122819716 BIRC4 STAG206 CCCCCGCGACGCGGCGG   34  28   1  20    7  2.01E−  1 200773326 C1orf15705 GCCCGCCTGAGCAAGGG 1332  92  33 143   83  5.46E−  9 101328287 C9orf125C9orf125 05 TTGCTCAGGCTGGTCTC 1333  98  23  93   69  6.04E− 05GAAAAGTTGAACTCCTG 1334   0   0  14    2  8.81E− 18 13631664 C18orf1C18orf1 05 CCTGTAATCCCAGCTAC 1335   7  25  15   22  0.00014 11, 165171573, 7 93  4, 16, 4149211, 23 17, 1, 220738, 162 20, 4 24677, 8872811, 364157 8, 6737623 CTGACCGCGCTCGCCCC   91  15   2  30    7  0.00015 5 138757992 DNAJC1 59 8 CCCACCAGGCACGTGGC 1337  79  21  98   55 0.00017 22 37564888 NPTXR CBX6 52 TTCTAACCCAATGCAAG 1338   1  10   0   4  0.00017 69 CAACCCACGGGCAGGTG  110   2   1  21    5  0.00017  6159560410 TAGAP 98 TGAAGATATACCCGTTT 1340  14  28  13   20  0.00018 07GCCTGGCTTCCCCCCAG 1341  65  13  46   42  0.00019  5 176814399 PRR7, GRPRR7, D  1 K6 BN1 GCCCGCGGGGCTGTCCC 1342  13   0  25   24  0.00023 1873090569 MBP GALR1 73 GCTATGGGTCGGGGGAG  215  45  13  79   41  0.0002517 39188537 SOST SOST, D 64 USP3 AGCTCTGGCAGTAGTTG 1344  41   6  51   23 0.00026 14 63874915 ESR2 MTHFD1 67 CACAGCCAGCCTCCCAG  213  27   0  39  30  0.00028 17 32372307 71 AAGCAGTCTTCGAGGGG 1346  89  27 105   60 0.00042  2 96903463 CNNM4 CNNM3 41 TTCTGCTAGACAGAAGA 1347  23  34  21  20  0.00047 64 GGGGATTCTACCCTGGG 1348  27  12  66   41  0.00054 2046877884 PREX1 ARFGEF 16 2 TCGGACGTACATCGTTA 1349 316 282 401  285 0.00060 99 GTGGCTCACATCTGTAC 1350  24   4  46   21  0.00065  4GCTGCCCCAAGTGGTCT  180   1   7  22    9  0.00071 12 47677137 81GCGCTGCCCTATATTGG 1352  11   2  24   24  0.00103 11 33018089 TCP11L1,TCP11L1 04 LOC91614 TGGAGATTTCAATCGCT 1353  18  34  27   22  0.00122 94AAGATCTTGAGCTTGGG 1354  92  26  84   78  0.00126 22, 2 18834687, 2 88 2, 22 0063861, 20 228651 CGGGCCGGGTCGGGCTC 1355   7   0   5   14 0.00141 16 4683601 MGRN1 NUDT16 07 L1, KIAA1 977 TGGCAAACCCATTCTTG 1356 79  20  82   66  0.00152  7 43682173 MRPS24 MRPS24, 45 URG4GTCCGTGGAATAGAAGG 1357   0   4   1   10  0.00156  4 37979694 TBC1D1FLJ1319  6 7 AGTATCAAAACGGCAGC 1358   8   2  20   22  0.00160  7122120649 CADPS2 TAS2R1 76 6 CCACTGCACTCCAGCCT 1359   7  25  16   12 0.00176 15, 2, 43372896, 1 97  3, 6, 7, 12885413, 1 X 72123633, 158701197, 1 27563622, 1 6561976 CCTGACAGGAACCACCC 1360  12   0   8    2 0.00185 58 TGGGAAGGCGTGGGGTG 1361  67  20  66   36  0.00188 49TTCGGCCCCATCCCTCG  313  10   0   1    9  0.00198 19 59668209 23GTGATAAAGGGAATATC 1363  35  34  23   22  0.00203 68 GCCACCGTCCTGCTGAC1364   2  11   3    1  0.00204 56 GAGATGCGCCTACGCCC 1365  28   3  42  24  0.00209 X 17153468 NHS NHS 14 ACCCGCACCATCCCGGG  229  89  46 140  72  0.00217 17 75432403 CBX4 TBC1D1 61 6 CGTGTGAGCTCTCCTGC 1367  85 37 131   76  0.00222  3 185762859 EPHB3 EPHB3  8 AACCCCGAAACTGGAAG 1368 16   1  25   14  0.00224  3 69064539 FAM19A4 AER61 05 GCCTCAGCATCCTCCTC1369  19   7   8    2  0.00224 22 44777822 FLJ10945 FLJ2736  2 5ACCCTGAAAGTCTAGCC 1370   7   2  22    6  0.00245 48 TGGCCTCTGACACCTGC1371   5   1   0   10  0.00256 15, 1 19241095, 1 66  8, 21 4440489, 13999446 TTTGCTTAGAGCCCAGC 1372   7   0   9   15  0.00263 16 66856002SLC7A6, L SLC7A6 57 YPLA3 OS TCTTCTATTGCCTGATT 1373  10   1   5    0 0.00287  9 112017089 SUSD1 SUSD1 99 GCTCGCCGAGGAGGGGC 1374  26  12  56  47  0.00304  3 28591784 AZI2 RBMS3 51 TTGCCCAGGCTGGTCCC 1375   0   6  0    1  0.00325 34 ACGGCCACTGAAACGGA 1376  18   1  14   18  0.00328 11198846 RIC8A, BE SIRT3, RI 51 T1L, ODF3 C8A CCTCAGATCAGGATGGG 1377  25  5  33   39  0.00336 X 41058142 DDX3X NYX  9 CGCGCAGCTCGCTGAGG 1378  17  2   4   14  0.00347 20 34924764 C20orf117 C20orf11 25 8GGCGTTAATAGAGAGGC 1379  15   2  25   10  0.00348  9 130564512 ASS PRDM1249 TTGCCCAGGCTGGTCTC 1380   2  14   5    6  0.00348  9 131187973 FAM78APPAPDC 82 3 TTGGCTAGGCTGGTCTC 1381   0   6   0    0  0.00350 81CCGCTGGGAGAGGGTTC 1382  19   9  49   26  0.00355 11 133331480 LOC28317JAM3 68 4 CCGCTTGCCCCGAAACC 1383   0   7   1    3  0.00356  9 109621801PALM2 PALM2- 32 AKAP2 ACCCTGAAAGCCTAGCC  266   6   3  24    9  0.0036821 45176032 ITGB2 C21orf69, 04 C21orf6 7, C21orf 70 CCCTGTCCTAGTAACGC1385  16   1   6    9  0.00379  8 38208799 DDHD2 DDHD2 27TCTCTTGGGGAGATGGG 1386  15   1  10    3  0.00402  5 138757992 PACAP, SDNAJC1 99 LC23A1 8 ACCCTCGCGTGGGCCCC 1387  25   3  35   16  0.00435 1912134824 ZNF625 ZNF136 19 ACACCTGTGTCACCTGG 1388   2   0  10    1 0.00435 15 26015921 OCA2 OCA2 86 CACACACACACCCGGGC 1389   0   3   9   0  0.00442  8 37774040 GPR124 BRF2 52 TATTTGCCAAGTTGTAC  113   4   0 14    6  0.00460  7 26997443 45 TCAAGTGTGAGGGGAAG 1391  28   3  25   13 0.00460 12 117004568 FLJ20674 PBP  8 TGCACGCACACTCTTCC 1392  22   3  16   8  0.00460  4 147216331 LOC15248 LOC152 94 5 485 TCACAAGGACAGATGCC1393   0   0   3    8  0.00468 16 68353990 WWP2, N WWP2 31 OB1PTCGAAGGCGGCCGGAGG 1394   0   0   1    7  0.00494  2 56323579 EFEMP1 VRK294 AAGAAATGCCGTTTCCA 1395   0   6   1    1  0.00539 91 TCACATTTCAGTTTGGG1396  33   7  46   22  0.00563  2 227854436 COL4A4 COL4A4, 95 COL4A3GGGTGCGGAACCCGGCC 1397  35   5  31   20  0.00583 20 26137059 C20orf91FLJ4583 62 2 GCAGAGGGCCTGCCCTT 1398   8   0   1    2  0.00583 12111958064 OAS2 DTX1 62 TGGGAAAGGTCTTGTGG 1399  40  12  65   47  0.0059610 102749640 LZTS2, PE LZTS2  9 O1 GGCAGGAAGACGGTGGA 1400   3   0  13   7  0.00602 22 49403345 ARSA ACR 49 ACTGTCAAGGTTTCAGG 1401  11   0  12   4  0.00609  4 185018413 FLJ12716 STOX2 87 CAGCCACACCAGTTGCC 1402   5  1   7   15  0.00612  1, 1 120323448,  2 142699053 GGCTTCACCATTGACTC1403  20   2  23   18  0.00657  6 AAGCAGTCTCCCAGGGG 1404   7   0   0   2  0.00677 10 101079937 HPSE2 CNNM1  5 TGGGACCCCAGCACGAC 1405   2   0  6   10  0.00684 17 GCCCGTTCTCAATGAGC 1406   2   7   0    7  0.0069210, 1 120645025, 78  2, 15, 68533541, 4 15, 1, 3372896, 50  1, 1, 1,365101, 157 22_(—) 811972, 189 random, 557275, 223  2, 626710, 227  3,3, 3, 896663, 222  4, 5, 794, 188246  5, 7, 7, 276, 380694  9 28,1092282 89, 1142489 45, 7080798 0, 37452235, 151074465, 127697694,138662914, 26653797 TATAAAATGTGTAAAGT 1407   6   4   0   10  0.00700 15,1 80434892, 8  5  5, 15, 0584867, 80 15, 1 742379, 808  5, 15, 21379,8097 15_(—) 9445, 82689 random, 354, 428294, 15_(—) 490281, 68 random,5562 15_(—) random CTACTGCACTCCAGCCT 1408   0   0   0    6  0.00741 64CAACCCCAACCGCGTTC 1409  13   5  17   27  0.00763  3 126257049 MUC13SLC12A 09 8 AGCTCATTTACATTTTA 1410   9   0   2    4  0.00768  6 35561523TEAD3 TEAD3 83 TGTCACAGACTCCCAGC 1411  32   8  22   12  0.00769 2115359515 NRIP1 USP25 03 GAAGCTTCGGGGTTCCC 1412   8   0  13   13  0.0077771 GACCCCACAAGGGCTTG 1413  22   6  23    5  0.00811 15 73922730 ODF3L1UBE2Q2 09 TGTGTCCTCGGCCCAGG 1414  16   2  22   10  0.00857  6 90177921RRAGD RRAGD 32 TTCCAGTGGCAAGTTGA 1415  71  25  77   43  0.00877 14104557983 CDCA4 CDCA4 43 CCCAGCAGAGAAGTCTG 1416   4   0   6   11 0.00878 11 129824700 ADAMTS1 ADAMTS 72 5 15 TATGTCAGTGTCTGGGA 1417   0  1   8    1  0.00889 19 35411442 C19orf2 ZNF536  6 GCCTTCGACCCCCAGGC1418   8   2   4   16  0.00890  9 136311861 BTBD14A LHX3 53CCCGCGCTCACTGCCAA 1419   9   1   2   12  0.00951 12 121990010 ARL6IP4,ARL6IP4, 13 FLJ13491, PITPNM ABCB9 2 CCAGGCAGGGGTGGGGG 1420  18   6  30   9  0.00954 16, 1 32804836, 3 78  6 3685485 ATGAGTCCATTTCCTCG 1421  23  5  40   20  0.00976  7 1479529 MGC1091 LOC401 31 1 296GGGGTAACTCTTGAGTC 1422   1   0   3    8  0.00977  8 145230748 SHARPIN,SHARPIN, 89 CYC1 MAF1, KIAA187 5 AGTGAGCCACCACACCC 1423   1   0   1    7 0.00988 10 116518059 ABLIM1 KIAA160 52 0 GCCAAGCCAAATGAAGG 1424   1   0  1    7  0.00988 10 72642515 UNC5B UNC5B 52 GATTATGAAAGCCCATC 1425  26  5  16   13  0.00993 11 128748605 RICS BARX2 99 ATGATTCCTTGCGATTC 1426  0   5   0    1  0.01006 84 GTAGGGGTAAAAGGAGG 1427   0   5   0    1 0.01006 84 TTGCCCAGGCTGGTCTT 1428   0   5   0    1  0.01006 84TTGGCCAGACTGGTCTG 1429   0   5   0    1  0.01006 84 CCTAACAAGATTGCATA1430  47  12  62   41  0.01025 16 68890570 AARS DDX19B, 73 DDX19- DDX19LTCTGAGGGTCGACCAGC 1431   0   5   0    0  0.01027 6 TCTTCATCCCCAAGCGG1432   0   5   0    0  0.01027 6 GACGAGAGCGCCGCCGC 1433   1   0   7    0 0.01050  2 106269374 UXS1 ST6GAL 13 2 GTGCCGCCGCGGGCGCC 1434   5  15 30   18  0.01051  1 22215644 WNT4 ZBTB40 68 GTGGATAAGTTTTTTGA 1435   0  5   1    0  0.01052 72 AGCCACCTGCGCCTGGC 1436  50  16  37   26 0.01187  4 80217832 PAQR3 GK2 29 CCCCCAAGACACATCAA 1437   7   4  24  10  0.01224 14 95052535 C14orf49 GLRX5 68 ACAAAAATGATCGTTCT 1438  46 10  41   31  0.01228  7 29841681 PLEKHA8, PLEKHA 19 FKBP14 8AGAACGGGAACCGTCCA 1439  39  21  29   52  0.01237 12 56418555 CENTG1CENTG1, 84 TSPAN3 1, CDK4 ACCATAGCAACCCTGCC 1440   2   0   2    8 0.01241 15 65920063 LBXCOR1 PIAS1  4 TGCCCTGGGCTGCCCGC 1441   7   1   4  13  0.01272  7 35070597 TBX20 FLJ2231 45 3 ATGGCCAGGCTGGTTTC 1442   2  5   0    0  0.01312 18 7106956 LAMA1 LAMA1 92 CGCCAGCGCCCGCGACC 1443  2   5   0    0  0.01312 92 GGTTTGCTGAAGTGGGG 1444   9   3  23   10 0.01317  9 137486498 FLJ20433 FLJ2043 29 3 AGCCGCGGGCAGCCGCC 1445   8  0   2    3  0.01341  9 132487454 FLJ46082 BARHL1, 84 DDX31GCGGGCGCGGCTCTGCG 1446   9   0   6    2  0.01348 18 12297562 CIDEA TUBB688 TGGAGCTGGTCGGGGAG 1447  16   4  27   12  0.01404 81 GCGCCAACCGGGGCTGC1448  12   1  16    6  0.01419  8 145605854 CPSF1 SLC39A 07 4GCCCCTGGGGCTTAACC 1449  21   3  14   12  0.01437 11 69602321 TMEM16ATMEM16  2 A ACCCACCAACACACGCC  679   9   2  19   17  0.01443  5170221996 RANBP17 RANBP1 72 7 GGCCGGTGCCGCCAGTC 1451  19   5  14   27 0.01525 14 99266585 CYP46A1 EML1 51 GCGGGGGCAGCAGACGC 1452  22   4  36  28  0.01536  8 71145343 PRDM14 PRDM14  3 AGGCAGGAGATGGTCTG 1453  22  5  32   12  0.01720  9 130564512 ASS PRDM12 91 AGAGAGAAGTTTCTGAG 1454  1   5   1    0  0.01730  9 TAAAAACTAGACAGAAG 1455   1   5   1    0 0.01730  9 AACTTGGGGCTGACCGG 1456   4   0   2    8  0.01737  7 69604814AUTS2 AUTS2 46 CCACTGCACTCCAGTCT 1457   0   5   1    1  0.01739 56GACAGACCCGGTCCCTG 1458   5   0   0    0  0.01757 20 17610446 RRBP1 RRBP196 AAAAGATGTGGTTTGGC 1459  24   6  38   17  0.01858 47 TGTTGAGAATGGGGTAG1460  14   1  13    7  0.01861  7 121538886 LOC38954 CADPS2 81 9AAGCGGGGAGGCTGAGG 1461   5   1  14   12  0.01884 20 60247223 OSBPL2, FOSBPL2  3 LJ44790 GAAACTGAACAACCTGC 1462  13  19   8   22  0.01921 81TCAGCCCAGCGGTATCC 1463  15   4  32   24  0.01951 20 17610446 RRBP1 RRBP1 4 GCCCTGTGTGTCAGCCT 1464   3   3   4   15  0.01964 16 22733582 HS3ST2HS3ST2 67 GGAACGCCCCACCCCGA 1465  12   1   4    8  0.02017 11 551070C11orf35, RASSF7  4 LRRC56 AACTGGCAGAGCAGCAG 1466   0   1   7    1 0.02022  5 52811829 MOCS2 FST 97 GTTTATTCCAAACACTG 1467  13   1   8  12  0.02035 19 53638755 GRIN2D GRIN2D, 04 GRWD1, KCNJ14CAGCCGAAGTGGCGCTC 1468  8   1   4   12  0.02078 11 270514 NALP6 NALP6, A98 THL1 GGGTAGGCACAGCCGTC 1469   4   0   4    9  0.02123 16 30010789TBX6, PPP YPEL3 63 4C CCTGTAATCCCAGCTGC 1470   1   1   0    6  0.0213266 CGTAGGGCCGTTCACCC 1471   2   4   6   14  0.02217 19 63765961 ZNF42,UB ZNF42  4 E2M, CHM P2A CCTGCGCCGCCGCCCGG 1472   5   1   8   13 0.02247 20 48241223 CEBPB CEBPB 32 CCTGCGCCGGGGGAGGC 1473 118  48 139 113  0.02273  4 3804825 FLJ35424 ADRA2C 99 TACGCGGGTGGGGGAAG 1474  67 27  62   37  0.02290 19 GCCACGAAGAACCGGCT 1475   1   0   1    6 0.02321 11 69298861 FGF4 FGF4 49 TGAGGTGTCAGTCTGCC 1476   1   8   2   3  0.02323  9 110077301 C9orf152 TXN  4 TCCCCATCGGTGGACCC 1477   0  1   6    0  0.02375 11 33847748 LMO2 LMO2  5 CTGCCCGCCTGCTTTCC 1478  1   0   6    0  0.02419  9 95352998 PTCH LOC375 51 748TGAAACGCTGAAGGGAG 1479   1   0   6    0  0.02419 51 CGATTCCATTAGATGAT1480   1   5   0    2  0.02470 46 CTGGGTTGCGATTAGCT 1481  44  15  29  40  0.02542  5 122462500 PPIC FLJ3609 25 0 AGGTTGTTGTTCTTGCC 1482   0  1   0    5  0.02568 76 CAGCTGCCTGGGGGAGG 1483   0   1   0    5 0.02568  2, 2 87000649, 1 76 06562389 GGAATTATCTCTTCCTT 1484   0   2  6    8  0.02576 15 66133874 PIAS1 PIAS1 67 CTATACTGGCTCGTCCT 1485  18  4   9    5  0.02602  3 10724319 ATP2B2 SLC6A1 43 1 TAACTGTCCTTTCCGTA1486  29  10  49   25  0.02620  8 92066919 EFCBP1 TMEM55 64 AGTCCGCACTACGAATCT 1487   0   0   7    4  0.02626  2 74668534 HTRA2, AAUP1, LO 06 UP1, DQX XL3, HTR 1 A2 ATCTGCCCGCCTCAGCC 1488   1   2   7   0  0.02654 19 60289933 EPS8L1 EPS8L1,  5 PPP1R1 2C AATTTGTTGCAGGGTCT1489  10   1   5    1  0.02694 31 TACCCTAAAACTTAAAG 1490   6  11   2   8  0.02743 12, 2 120525394, 92  2 21544337 AAACGAATTACACGGTG 1491   1  0   0    5  0.02766 21 GCAGCCCCTTGGTGGAG 1492  46  12  50   46 0.02787 16 88518083 TUBB3, M TUBB3 52 C1R CACAGCAGCCCGTCAGG 1493   1  0   4    7  0.02809  9 10603198 PTPRD TYRP1 68 CCAGTGCACTCCAGCCT 1494 11   1   3    6  0.02842  1 39767910 HEYL HEYL 94 TGAGGTGTCAGTGTGCC1495   0   0   1    5  0.02898 63 ACGCCGGGGCCGCTCGC 1496   0   4   0   0  0.02899  4 38487591 FLJ13197 KLF3, FL  3 J13197 AGCCACCCCGCCTGGCC1497   0   4   0    0  0.02899  3 AGCCCTGGGGAAAGGGG 1498   0   4   0   0  0.02899  3 AGTCCTGCACAGAAACT 1499   0   4   0    0  0.02899  3ATGCTCCTAAGCCAAAA 1500   0   4   0    0  0.02899  3 ATTTGAGGGTTTGGGAC1501   0   4   0    0  0.02899  3 CATAACCTAAGGTGAAG 1502   0   4   0   0  0.02899  3 CCCTATGCCTACCCAAG 1503   0   4   0    0  0.02899  3CTCGGAAGGAAGCACCA 1504   0   4   0    0  0.02899  3 CTGGACAGAAGGGACTG1505   0   4   0    0  0.02899  3 GCCTTTCATAGAGCAGC 1506   0   4   0   0  0.02899  3 GCGAAACCCCTCCCCCC 1507   0   4   0    0  0.02899  3GCTAAACCCTCAACAAG 1508   0   4   0    0  0.02899  3 GGAAACTGAGGCAGAAG1509   0   4   0    0  0.02899  3 GGAGCTGGCAGCAGAGG 1510   0   4   0   0  0.02899  3 GTGGCTTGCGCCTGTAC 1511   0   4   0    0  0.02899  3GTGGTACCACAGATGGG 1512   0   4   0    0  0.02899  3 GTGGTGTGAGCCTGTAA1513   0   4   0    0  0.02899  3 TAAGGCTAGACAGGAGA 1514   0   4   0   0  0.02899  3 TATCTGTAACTTACTAA 1515   0   4   0    0  0.02899  3TGAAGATATACCCGTTC 1516   0   4   0    0  0.02899  3 GCCAGGGCCCAGGGGTC1517   6   2  12    1  0.02914  7, 7 56827509, 6 36 2532332CGAACTTCCCGGTTCCG 1518  45  13  49   28  0.02923 12 127277890 SPRR2GSLC15A 54 4 GTGGCTTGCGCCTGTAG 1519  15   5  15   24  0.02925 14103407981 PPP1R13 C14orf2  7 B CACTCCACGTTTATAGA 1520   1   0   7    7 0.02948  4 146760778 SMAD1 SMAD1 68 AGCAGTGGAAGCTTGAG 1521  11   2   4  13  0.03015  3 148597613 ZIC4 ZIC4 48 GCCTGACCCTTTTCTGC 1522   0   2  6    0  0.03035 17 75366221 ENPP7 CBX2 22 GCCGGGGCGGGCTCCTC 1523   6  1  12    2  0.03055 49 CAGAGGGAATAACCAGT 1524   3   1   5   11 0.03062 19 40183199 GRAMD1 GRAMD 69 A 1A AGCCACTGTGCCCAGCC 1525   3   5  0    1  0.03067 96 AGCCACCACACCTGGCT 1526   1   4   0    0  0.03117 59ATTATAAGTTTCCTGAG 1527   1   4   0    0  0.03117 59 GGCTACAGAGTGAGAGC1528   1   4   0    0  0.03117 59 AGCCATCACGCCCGGCC 1529   0   4   0   1  0.03140 57 CAGCAGTTTCTGAGAAT 1530   0   4   0    1  0.03140 57TACATTTCTATTTGTGG 1531   0   4   0    1  0.03140 57 CAGAATCTTCAAAAAGA1532   0   0   5    0  0.03164 32 TACACCAGCGTGGAGGG 1533   0   0   5   0  0.03164  2 47660006 KCNK12 KCNK12 32 CGGAGCCGCCCCAGGGG 1534   1  0   6    7  0.03265 11 496887 RNH1 RNH1 71 TATCCCAGAACTTAAAG 1535   0  5   1    4  0.03272  6 117609989 RFXDC1 VGLL2 76 TGCAAATTGTGGGGGTG1536  37  13  39   17  0.03295 63 CAGCCGACTCTCTGGCT 1537  44  12  33  34  0.03295  3 2115478 CNTN6 CNTN4 84 GGCACCGTCCTGCTGTC 1538  10   1  4    2  0.03299  5 TGCAAGTGGACATTTGG 1539   5   2   0    0  0.03318 88ACAAAGTACCGTGGTTC 1540  16   3  28   23  0.03319 12 121784028 TSP-TSP-NY 11 NY, DENR CCAAATCCTACCCAGCC 1541   0   2   0    5  0.03398 1470178138 MED6 MAP3K9 17 ATGGTGTCGCTGGACAG 1542  11   1   5   10  0.03466 2 218907280 IL8RA ARPC2 32 TTCGGGCCGGGCCGGGA 1325  27  12  47   20 0.03510  1 162057422 LMX1A RXRG 55 ATGTATCTACTCAGCTA  934   0   5   3   1  0.03580 45 TATCAACTTGCAAATTC 1208   0   5   3    1  0.03580 45TCCATAGATTGACAAAG 1327  26   5  31   16  0.03662  6 114288310 MARCKSMARCK 97 S CCAGCGGACTGCGCTGC   35   0   1   2    6  0.03669  5 176169485TSPAN17 UNC5A 66 AGCAACTTTCCTGGGTC   302  25   4  30   27  0.03706 2030259008 PLAGL2, PLAGL2, 64 POFUT1 GGCTCTCTGGATTCCCC   303   6   0   2   1  0.03714  6 19800086 IBRDC2 ID4 74 CAGCAGCAGTGGGGCTG 1331   2   0  6    0  0.03751  3 13566249 FBLN2 FBLN2 65 GGTCCATCTGCAAAGGG  677   4  1  12    3  0.03771 19, 1 43952443, 4 36  9 3975229 AATGAACGACCAGACCC 250  32  17  63   43  0.03801 10 70386398 DDX21, D DDX21 87 DX50TAATCTCCCTAAATACC 1336  23  12  38   42  0.03830  7 75592300 HSPB1 YWHAG05 CTCCGGGTGGGGAGGCC  700   1   0   2    6  0.03873 14 104187893FLJ42486 C14orf15 89 1 AACCCAGGAGGCGGAGC 1163   0   5   2    2  0.04039 8 74877871 UBE2W UBE2W 61 GCGTTTGGGGGTGTCGG 1339   2   0   0    5 0.04077  4 147216331 LOC15248 LOC152 87 5 485 GCGAAACCCCGTCTCTA  481  5   5   1   10  0.04088 12, 1 74400342, 2 21  7, 17, 626651, 526 17, 12441, 34250  9, 4, 8, 652, 717144  9 3, 116851, 9 4781802, 66 71656AAACGAAAGGTTCAAGT 1345  10  21  15   10  0.04095 08 CAGATTCTACAAAAGGA 843   0   4   0    2  0.04134 42 AGCCACTGCACCTGGCC 1351   1   7   1   4  0.04231  1, 1, 2 231516029, 53  0 231648771, 44807423CCGGACGTACATCGTTA 1362   5   0   0    5  0.04306 57 GCAGCGGCGCTCCGGGC1215  19   2  25   20  0.04322  1 151836629 DCST1 ADAM15 48TTTCCAGTGCAATTCCG  707   3   2   9   13  0.04384 02 TTTCTTCTAACAAAGGC 676   0   0   2    5  0.04399  5 65257128 NLN ERBB2IP 43ACCCTCTCACACGCACC 1324   4   0   0    0  0.04440 93 AGGCTGGGGCACAGGAC 926   4   0   0    0  0.04440 19 51834661 GNG8 MGC154 93 76CCAACGCCTGAAGCTCT 1203   4   0   0    0  0.04440 10 30064273 SVIL SVIL93 TCTCTGTAGCTCACCCG  300   4   0   0    0  0.04440 19 2376268 TMPRSS9TIMM13, 93 TMPRSS 9, LMNB2 TGCAACCACCTGAGGTT 1343   4   0   0    0 0.04440  2, 2_(—) 242462672, 93 random 167214 GAAATGCTAAGGGGTTG  296 10   6  25    9  0.04482  1 9646024 RP13- PIK3CD 12 15M17.2AGCCACTGCGCCCGGCC  544   3   8   5    1  0.04493  7 150438654 SMARCDNYREN1 33 3 8 CCCCGGCAGGCGGCGGC  227  40  13  51   27  0.04507 11124175712 FLJ23342 ROBO3 11 GCCACCGTCCTGCTGTC 1205 128 912 146 1184 0.04545   4   7 91 CAGCCAGCTTTCTGCCC  139  47  20  56   26  0.04559  9136323041 LHX3 QSGN6L 06 1 TTGGCCAGGCTGGTCTC  812  45  51  52   47 0.04610 10, 1 102269169, 99  0, 14, 119125579, 14, 1 104353395,  7, 19,104838293, 19, 1, 2574777, 95  1, 1, 1, 1525, 54391 20, 4, 626, 672837 5, 5, 6, 9576680,  7, 7, 7, 200773326,  8, 8, 239591215,  8 44814870, 3623233, 149 090483, 149 717373,6 89 386, 655378 21, 1042663 33, 42251455, 42603361, 68020728 CCATTGCATTCCATTCC  789   0   0   0    4  0.0465406 CCTGGCTAATTTTTTGT 1078   0   0   0    4  0.04654 06 CCTTTGGGTGGAGCAGT 271   0   0   0    4  0.04654 06 CTACAGGCTGGAGGGCA  937   0   0   0   4  0.04654 19 1464508 THSD6 RKHD1 06 GCCATAACTTTTAAGTC  488   0   0  0    4  0.04654 14 74418552 DLST DLST 06 GGGTGGGGGGTGCAGGC  939   0  0   0    4  0.04654  2 241695521 FLJ22671 MTERFD 06 2GTCTCGCTGGCTTCAGG 1113   0   0   0    4  0.04654 15 91055991 LOC40045CHD2 06 1 GTGACTTTCTTCGGGGG 1366   0   0   0    4  0.04654 10 79066844KCNMA1 KCNMA1 06 TGGGGACCCGAGAAGGG  592   0   0   0    4  0.04654 2236239821 CARD10 CDC42E 06 P1 TTGATTTGTGAATACCC 1002   0   0   0    4 0.04654 06 GCAGGGAAGAGAGGAGC 1129   0   1   5    0  0.04942 12117004568 FLJ20674 PBP 05 ATGCGAGGGGCGCGGTA 1162  37   9  44   32 0.04991  2 37811338 CDC42EP FAM82A 62 3 P value, the significance ofthe difference in the raw abundances of the relevant MSDK tag betweenthe four libraries. SEQ ID NO:, refers to the Sequence IdentificationNumber assigned to each MSDK-tag nucleotide sequence CD10, CD24, CD44,MUC1, refer to the different cell populations used in the MSDK analysis.AscI position, refers to the bp position within the correspondingchromosome(s) where the AscI site is located. Chr, chromosome in whichMSDK tag sequence is located. UpGene, refers to nearest gene 5′ to theAscI site. DnGene, refers to the nearest gene 3′ to the AscI site.In addition, CD10+ and MUC1+ cells were also found to be hypomethylatedcompared to CD24+ cells. This latter observation raised the hypothesis(also suggested by SAGE data on these cells) that CD10+ and MUC1+ cellsmay represent a mix of terminally differentiated myoepithelial andluminal epithelial cells, respectively, and their lineage committedprogenitors, while CD24+ cells are mostly terminally differentiatedluminal epithelial cells. To identify loci specifically methylated instem or differentiated cells of a specific lineage (luminal ormyoepithelial), pair-wise as well as combined comparisons of the MSDKlibraries were performed. Statistically significant (p<0.05) differenceswere found in each of these comparisons and led to the identification oftags that were specifically methylated in differentiated (luminal ormyoepithelial) cells (see FIG. 26C). Interestingly, many of the geneshypomethylated in CD44+ cells encode homeogenes, polycomb (chromo domaincontaining) proteins, or proteins involved in pathways known to beimportant for stem cell function. A detailed summary of these genes isshown in Table 16.

TABLE 16 Selected Differentially Methylated Genes in the CD44+ andCD24+ Libraries SEQ ID Tag NO: CD24 CD44 p value Ratio Chr Gene DistancePosition Strand Function CACAGCCAGCCTCCCAG  213  0  39 5.47E−07 22 17LHX1 3696 inside + Homeobox gene TATTTGCCAAGTTGTAC  113  0  140.00205972  8  7 HOXA10 −4360 upstream − Homeobox gene TATTTGCCAAGTTGTAC 113  0  14 0.00205972  8  7 HOXA11 627 inside − Homeobox geneACCCACCAACACACGCC  679  2  19 0.00311433  5  5 TLX3 −446896 upstream +Homeobox gene TCGCCGGGCGCTTGCCC   90  7  66 9.33E−08  5  5 PITX1 6168inside − Homeobox gene ACAATAGCGCGATCGAG  904  2  14 0.0178476  4 16IRX3 −644272 upstream − Homeobox gene ACAATAGCGCGATCGAG  904  2  140.0178476  4 16 IRX5 −460 upstream + Homeobox gene TTAAGAGGGCCCCGGGG1384  0   7 0.0241671  4 14 NKX2-8 1823 inside − Homeobox geneGAAGGGAATCACAAAAC 1390  0   7 0.0241671  4  4 PHOX2B −124519 upstream −Homeobox gene GCTATGGGTCGGGGGAG  215 13  79 2.60E−07  3 17 MEOX1 −94080upstream − Homeobox gene AGCCCTCGGGTGATGAG   29  5  24 0.0106181  3  1LMX1A −747 upstream − Homeobox gene CCCCGTTTTTGTGAGTG  221  6  220.0355276  2 17 HOXB9 −20615 upstream − Homeobox gene AGCAGCAGCCCCATCCC 811 19  55 0.0136901  2 10 EMX2 −166366 upstream + Homeobox geneCAGCCAGCTTTCTGCCC  139 20  56 0.0169362  2  9 LHX3 −141 upstream −Homeobox gene CCCCAGGCCGGGTGTCC  303  9  37 0.0070473  2 17 CBX8 −16725upstream − Polycomb protein ACCCGCACCATCCCGGG  229 46 140 5.96E−06  2 17CBX4 −4595 upstream − Polycomb protein CACCAAACCTAGAAGGC  591 10  330.0383201  2  2 GLI2 −56233 upstream + Shh pathway ACCCTGAAAGCCTAGCC 266  3  24 0.00179963  4 21 ITGB2 −10800 upstream − stem cell markerTGGTTTACCTTGGCATA  252  0  13 0.00977299  7  6 FOXF2 −6378 upstream +Development/ differentiation GTCCTTGTTCCCATAGG   97  0  35 2.40E−06 19 6 FOXC1 −5061 upstream + Development/ differentiation CCCCCGCGACGCGGCGG  34  0  20 0.000800427 11  1 SOX13 −576 upstream + Development/differentiation TGCTTGGATCGTGGGGA  0  11 0.0187511  6 17 SOX15 −24267upstream − Development/ differentiation CACTCCACGTTTATAGA 1520  0   70.0241671  4  4 SMAD1 −783 upstream + TGFb signaling GTTTTGGGGGAATGGCA1450  2  14 0.0178476  4  6 WISP3 −180585 upstream + WNT/APC/BCTNpathway CACAGCCAGCCTCCCAG  213 44 113 0.00118262  1  2 TCF7L1 854inside + WNT/APC/BCTN pathway P value, the significance of thedifference in the raw abundances of the relevant MSDK tag between thefour libraries. SEQ ID NO:, refers to the Sequence Identification Numberassigned to each MSDK-tag nucleotide sequence CD24 and CD44, refer tothe different cell populations (e.g., stem cell and differentiated cellpopulations) used in the MSDK analysis. Chr, chromosome in which MSDKtag sequence is located. Gene, refers to nearest gene to the AscI site.Position, refers to the location of the AscI site within the associatedgene, (i.e., Upstream (5′) or inside (within the intronic or exonicportion of the gene). Distance, refers to the distance of the AscI sitefrom the start site of transcription for the associated gene. Function,refers to the putative function associated with each gene located nearthe respective AscI site.

Example 9 Confirmation of Stem and Differentiated Cell MSDK Results byBisulfite Sequencing Analysis

To confirm the MSDK results, sets of statistically significantlydifferentially methylated genes from each comparison were selected andtheir methylation status was analyzed by sequence analysis of bisulfitetreated genomic DNA from the same sample that was used for MSDK. Thesegenes included FNDC1 and FOXC1 (hypomethylated in CD44+ cells comparedto all others), PACAP (hypomethylated in CD44+ and CD10+ cells comparedto others), SLC9A3R1 (hypomethylated in CD24+ MUC1+ and CD10+ cellscompared to CD44+), DDN1 (hypomethylated in CD44+ compared to CD10+cells), and DTX1 and CDC42EP5 (hypomethylated in CD10+ compared to CD44+cells). In all these cases, bisulfite sequencing analysis confirmed theMSDK results (see FIG. 27A).

Example 10 Determination of the Frequency and Consistency of MethylationDifference Between Stem and Differentiated Cells by qMSP

To determine how consistently the selected genes of FIG. 27A aredifferentially methylated in stem and differentiated cells from multipleindependent women, the quantitative methylation specific PCR (qMSP)assay (described above) was utilized to analyze methylation in a largerset of samples. qMSP confirmed MSDK and bisulfite sequencing data anddemonstrated that cell lineage specific methylation is consistent amongsamples derived from women of different ages (18-58 years old) andreproductive history, although some variability in the degree ofmethylation was observed (see FIG. 27B).

Example 11 Analysis of Gene Expression of Selected Genes DifferentiallyMethylated in Stem and Differentiated Cells by qRT-PCR

To characterize the effect of methylation changes on gene expression,the expression of the selected genes was analyzed by quantitative RT-PCRin the same cells that were analyzed by qMSP in Example 10. FIG. 28shows the relative expression of the selected genes differentiallymethylated in CD44+, CD10+, MUC1+, and CD24+ cell subsets. Overall, anassociation between the methylation status and expression of the geneswas observed. However, methylation did not have the same effect onexpression of all the genes. The expression of FNDC1, DDN, LHX1, andHOXA10 was lower in methylated samples, while PACAP and CDC42EP5 wereexpressed at higher levels in hypermethylated cells. In the case ofFOXC1 and SOX13 in the CD44+, MUC1+, and CD24+ samples, there was aninverse association between methylation and gene expression, but FOXC1was expressed in CD10+ cells despite being methylated and SOX13 was nothighly expressed in CD10+ cells despite being hypomethylated. Thesevariations could result if the CD10+ cell fraction is a mix ofmyoepithelial progenitor and committed myoepithelial cells, and thus,has both progenitor and differentiated cell properties.

Example 12 Correlation of Methylation Status to Clinico-PathologicCharacteristics of Breast Carcinomas

To determine if the methylation of the most highly cell lineagespecifically methylated genes would correlate with clinico-pathologiccharacteristics of breast carcinomas, the methylation of PACAP, FOXC1(both unmethylated in CD44+ cells compared to MUC1, CD24+ and CD10+cells), and SLC9A3R1 (hypermethylated in CD44+ cells compared to allthree other cell types) were analyzed in 149 sporadic invasive ductalcarcinomas, 11 BRCA1⁺ tumors, 21 BRCA2⁺ tumors, and 14 phyllodes tumors.Based on this analysis, the methylation of PACAP and FOXC1 were found tobe statistically significantly associated with hormone receptor(estrogen receptor-ER, progesterone receptor-PR) and HER2 status of thetumors and with tumor subtypes. Basal-like tumors (defined asER⁻/PR⁻/HER2⁻) and BRCA1 tumors exhibited the same methylation profileas normal CD44+ stem cells, while ER⁺ and HER2⁺ tumors were more similarto differentiated cells. These results supported the hypothesis thateither (a) different tumor subtypes have distinct cells of origin or (b)cancer stem cells in different tumors have different differentiationpotential.

To evaluate these two hypotheses, qMSP analyses of putative cancer stem(lin⁻/CD24^(−/low)/CD44⁺/EPCR⁺) and differentiated cells (CD24+) cellswere performed using genes that were highly cell type specificallymethylated in normal breast tissue (see FIG. 29A). This analysisdemonstrated that the DNA methylation profiles of tumor stem (CD44+) andCD24+ cells were the same as their corresponding normal counterparts,suggesting that regardless of the tumor subtype, cancer stem cells arelikely to be more similar to each other and to normal stem cells than tomore differentiated (CD24+) cells from the same tumor.

Example 13 Correlation of Methylation Status to Clinico-PathologicCharacteristics of Breast Carcinomas

Based on the hypothesis that cancer stem cells are responsible for themetastatic spread and recurrence of tumors, the number of cancer stemcells would be expected to be higher in distant metastases compared toprimary tumors. To test this hypothesis, the methylation status of fourof the most highly cell type specifically methylated genes in primarytumors and matched distant metastases (collected from the same patient)was analyzed. Unexpectedly, the methylation of HOXA10, FOXC1, and LHX1was higher in distant metastases compared to primary tumors, approachingor even exceeding levels detected in differentiated CD24+ cells, whileno clear pattern was observed for PACAP (see FIG. 29B). This suggestedthat the number of CD24+ cells is increased in the distant metastasis, afinding reinforced by immunohistochemical analyses of these samplesusing stem and differentiated cell markers. Of the several plausibleexplanations of these results, the most likely is cell plasticity anddifferent selection conditions in the primary tumor and distantmetastases. Indeed, analysis of E-cadherin methylation and expressiondemonstrated that cell differentiation is a dynamic process and couldoccur during the metastatic progression. Thus, it is possible that theCD44+ cancer stem cells were the ones that metastasize, but theydifferentiate at the site of metastasis. Analysis of the geneticcomposition of CD24+ and CD44+ cells at the single cell level in primarytumors and matched metastases would be necessary to decipher thisquestion.

In summary, the genome-wide DNA methylation profile of human putativemammary epithelial stem cells and differentiated luminal andmyoepithelial cells was determined. Genes that were found to bemethylated in a cell type specific manner demonstrated that cancer stemand differentiated cells are epigenetically distinct and are moresimilar to their corresponding normal counterparts than to each other,and the methylation status of selected genes classified breast tumorsinto cell subtypes.

1. A method of making a methylation specific digital karyotyping (MSDK)library, the method comprising: providing all or part of the genomic DNAof a test cell; exposing the DNA to a methylation-sensitive mappingrestriction enzyme (MMRE) to generate a plurality of first fragments;conjugating to one terminus or to both termini of each of the firstfragments a binding moiety, the binding moiety comprising a first memberof an affinity pair, the conjugating resulting in a plurality of secondfragments; exposing the plurality of second fragments to a fragmentingrestriction enzyme (FRE) to generate a plurality of third fragments,each third fragment comprising at one terminus the first member of theaffinity pair and at the other terminus the 5′ cut sequence of the FREor the 3′ cut sequence of the FRE; contacting the plurality of thirdfragments with an insoluble substrate having bound thereto a pluralityof second members of the affinity pair, said contacting resulting in aplurality of bound third fragments, each bound third fragment being athird fragment bound via the first and second members of the affinitypair to the insoluble substrate; conjugating to free termini of thebound third fragments a releasing moiety, the releasing moietycomprising a releasing restriction enzyme (RRE) recognition sequenceand, 3′ of the recognition sequence of the RRE, either the 5′ cutsequence of the FRE or the 3′ cut sequence of the FRE, the conjugatingresulting in a plurality of bound fourth fragments, each bound fourthfragment (i) comprising at one terminus the recognition sequence of theRRE and (ii) being bound via the first member of the affinity pair atthe other terminus and the second member of the affinity pair to theinsoluble substrate; and exposing the bound fourth fragments to the RRE,the exposing resulting in the release from the insoluble substrate of aMSDK library, the library comprising a plurality of fifth fragments,each fifth fragment comprising the releasing moiety and a MSDK tag, thetag consisting of a plurality of base pairs of the genomic DNA.
 2. Themethod of claim 1, wherein the MMRE is AscI.
 3. The method of claim 1,wherein the FRE is NlaIII.
 4. The method of claim 1, wherein the RRE isMmeI.
 5. The method of claim 1, wherein the binding moiety furthercomprises a 5′ or 3′ cut sequence of the MMRE.
 6. The method of claim 1,wherein the binding moiety further comprises, between the 5′ or 3′recognition sequence of the MMRE and the first member of an affinitypair, a linker nucleic acid sequence comprising a plurality of basepairs.
 7. The method of claim 1, wherein the releasing moiety furthercomprises, 5′ of the RRE recognition sequence, an extender nucleic acidsequence comprising a plurality of base pairs.
 8. A method of analyzinga MSDK library, the method comprising; providing a MSDK library made bythe method of claim 1; identifying the nucleotide sequences of one tag,a plurality of tags, or all of the tags.
 9. The method of claim 8,wherein identifying the nucleotide sequences of a plurality of tagscomprises: making a plurality of ditags, each ditag comprising two fifthfragments ligated together; forming a concatamer comprising a pluralityof ditags or ditag fragments, wherein each ditag fragment comprises twoMSDK tags; determining the nucleotide sequence of the concatamer; anddeducing, from the nucleotide sequence of the concatamer, the nucleotidesequences of one or more of the MSDK tags that the concatamer comprises.10. The method of claim 9, wherein the ditag fragments are made byexposing ditags to the FRE.
 11. The method of claim 9, furthercomprising, after making a plurality of ditags and prior to forming theconcatamers, increasing the number of ditags by PCR.
 12. The method ofclaim 8, further comprising determining the relative frequency of someor all of the tags.
 13. A method of analyzing a MSDK library, the methodcomprising: providing a MSDK library made by the method of claim 1; andidentifying a chromosomal site corresponding to the sequence of a tagselected from the library.
 14. The method of claim 9, further comprisingdetermining a chromosomal location, in the genome of the test cell, ofan unmethylated full recognition sequence of the MMRE closest to theidentified chromosomal site.
 15. The method of claim 13, wherein theidentification of the chromosomal site and the determination of thechromosomal location is performed by a process comprising comparing thenucleotide sequence of the selected tag to a virtual tag librarygenerated using the nucleotide sequence of the genome or the part of agenome, the nucleotide sequence of the full recognition sequence of theMMRE, the nucleotide sequence of the full recognition sequence of theFRE, and the number of nucleotides separating the full recognitionsequence of the RRE from the RRE cutting site.
 16. A method ofdetermining the chromosomal location of a plurality of unmethylatedrecognition sequences of the MMRE, the method comprising repeating themethod of claim 14 with a plurality of tags obtained from the library.17. The method of claim 1, wherein the test cell is a vertebrate cell.18. The method of claim 1, wherein the test cell is a mammalian testcell.
 19. The method of claim 18, wherein the mammalian test cell is ahuman test cell.
 20. The method of claim 18, wherein the test cell is anormal cell.
 21. The method of claim 18, wherein the test cell is acancer cell.
 22. The method of claim 21, wherein the cancer cell is abreast cancer cell.
 23. The method of claim 1, wherein the first memberof the affinity pair is biotin or iminobiotin.
 24. The method of claim1, wherein the first member of the affinity pair is an antigen, ahaptenic determinant, a single-stranded nucleotide sequence, a hormone,a ligand for adhesion receptor, a receptor for an adhesion ligand, aligand for a lectin, a lectin, a molecule containing all or part of animmunoglobulin Fc region, bacterial protein A, or bacterial protein G.25. The method of claim 1, wherein the insoluble substrate comprisesmagnetic beads.
 26. A method of classifying a biological cell, themethod comprising: (a) performing the method of claim 12, therebyobtaining a test MSDK profile for the test cell; (b) comparing the testMSDK profile to separate control MSDK expression profiles for one ormore control cell types; (c) selecting a control MSDK profile that mostclosely resembles the test MSKD profile; and (d) assigning to the testcell a cell type that matches the cell type of the control MSDK profileselected in step (c).
 27. The method of claim 26, wherein the test andcontrol cells are vertebrate cells.
 28. The method of claim 27, whereinthe test and control cells are mammalian cells.
 29. The method of claim28, wherein the test and control cells are human cells.
 30. The methodof claim 28, wherein the control cell types comprise a control normalcell and a control cancer cell of the same tissue as the normal cell.31. The method of claim 30, wherein the control normal cell and thecontrol cancer cell are breast cells.
 32. The method of claim 30,wherein the control normal cell and the control cancer cell are of atissue selected from the group consisting of colon, lung, prostate, andpancreas.
 33. The method of claim 30, wherein the test cell is a breastcell.
 34. The method of claim 30, wherein the test cell is of a tissueselected from the group consisting of colon, lung, prostate, andpancreas.
 35. The method of claim 26, wherein the control cell typescomprise cells of different categories of a cancer of a single tissue.36. The method of claim 35, wherein the different categories of a cancerof a single tissue comprise a breast ductal carcinoma in situ (DCIS)cell and an invasive breast cancer cell.
 37. The method of claim 35,wherein the different categories of a cancer of a single tissue comprisetwo or more of: a high grade DCIS cell, an intermediate grade DCIS cell;and an low grade DCIS cell.
 38. The method of claim 28, wherein thecontrol cell types comprise two or more of: a lung cancer cell; a breastcancer cell; a colon cancer cell; a prostate cancer cell; and apancreatic cancer cell.
 39. The method of claim 26, wherein the controlcell types comprise an epithelial cell obtained from non-canceroustissue and a myoepithelial cell obtained from non-cancerous tissue. 40.A method of diagnosis, the method comprising: (a) providing a testbreast epithelial cell; (b) determining the degree of methylation of oneor more C residues in a gene in the test cell, wherein the gene isselected from those identified by the MSDK tags listed in Table 5,wherein the one or more C residues are C residues in CpG sequences; and(c) comparing the degree of methylation of the one or more residues tothe degree of methylation of corresponding one or more C residues in acorresponding gene in a control epithelial cell obtained fromnon-cancerous breast tissue, wherein an altered degree of methylation ofthe one or more C residues in the test epithelial cell compared to thecontrol epithelial cell is an indication that the test epithelial cellis a cancer cell. 41-44. (canceled)
 45. The method of claim 40, whereinthe gene is selected from the group consisting of PRDM14 and ZCCHC14.46. A method of diagnosis, the method comprising: (a) providing a testcolon epithelial cell; (b) determining the degree of methylation of oneor more C residues in a gene in the test cell, wherein the gene isselected from those identified by the MSDK tags listed in Table 2,wherein the one or more C residues are C residues in CpG sequences; and(c) comparing the degree of methylation of the one or more residues tothe degree of methylation of corresponding one or more C residues in acorresponding gene in a control epithelial cell obtained fromnon-cancerous colon tissue, wherein an altered degree of methylation ofthe one or more C residues in the test epithelial cell compared to thecontrol epithelial cell is an indication that the test epithelial cellis a cancer cell. 47-50. (canceled)
 51. The method of claim 46, whereinthe gene is selected from the group consisting of LHX3, TCF7L1, andLMX-1A.
 52. A method of diagnosis, the method comprising: (a) providinga test myoepithelial cell obtained from a test breast tissue; (b)determining the degree of methylation of one or more C residues in agene in the test cell, wherein the gene is selected from thoseidentified by the MSDK tags listed in Table 10, wherein the one or moreC residues are C residues in CpG sequences; and (c) comparing the degreeof methylation of the one or more residues to the degree of methylationof corresponding one or more C residues in a corresponding gene in acontrol myoepithelial cell obtained from non-cancerous breast tissue,wherein an altered degree of methylation of the one or more C residuesin the test myoepithelial cell compared to the control myoepithelialcell is an indication that the test breast tissue is cancerous tissue.53-56. (canceled)
 57. The method of claim 52, wherein the gene isselected from the group consisting of HOXD4, SLC9A3R1, and CDC42EP5. 58.A method of diagnosis, the method comprising: (a) providing a testfibroblast obtained from a test breast tissue; (b) determining thedegree of methylation of one or more C residues in a gene in the testcell, wherein the gene is selected from those identified by the MSDKtags listed in Tables 7 and 8, wherein the one or more C residues are Cresidues in CpG sequences; and (c) comparing the degree of methylationof the one or more residues to the degree of methylation ofcorresponding one or more C residues in a corresponding gene in acontrol fibroblast obtained from non-cancerous breast tissue, wherein analtered degree of methylation of the one or more C residues in the testfibroblast compared to the control fibroblast is an indication that thetest breast tissue is cancerous tissue. 59-62. (canceled)
 63. The methodof claim 58 wherein the gene is Cxorf12.
 64. A method of determining thelikelihood of a cell being an epithelial cell or a myoepithelial cell,the method comprising: (a) providing a test cell; (b) determining thedegree of methylation of one or more C residues in a gene in the testcell, wherein the gene is selected from those identified by the MSDKtags listed in Table 12, wherein the one or more C residues are Cresidues in CpG sequences; and (c) comparing the degree of methylationof the one or more residues to the degree of methylation ofcorresponding one or more C residues in a corresponding gene in acontrol myoepithelial cell and to the degree of methylation ofcorresponding one or more C residues in a corresponding gene in acontrol epithelial cell, wherein the test cell is: (i) more likely to bea myoepithelial cell if the degree of methylation in the test samplemore closely resembles the degree of methylation in the controlmyoepithelial cell; or (ii) more likely to be an epithelial cell if thedegree of methylation in the test sample more closely resembles thedegree of methylation in the control epithelial cell. 65-66. (canceled)67. The method of claim 64, wherein the gene is selected from the groupconsisting of LOC389333 and CDC42EP5.
 68. A method of diagnosis, themethod comprising: (a) providing a test cell from a test tissue; (b)determining the degree of methylation of one or more C residues in aPRDM14 gene in the test cell, wherein the one or more C residues are Cresidues in CpG sequences; and (c) comparing the degree of methylationof the one or more residues to the degree of methylation ofcorresponding one or more C residues in the PRDM14 gene in a controlcell obtained from non-cancerous tissue of the same tissue as the testcell, wherein an altered degree of methylation of the one or more Cresidues in the test cell compared to the control cell is an indicationthat the test cell is a cancer cell. 69-74. (canceled)
 75. A method ofdiagnosis comprising: (a) providing a test sample of breast tissuecomprising a test epithelial cell; (b) determining the level ofexpression in the test epithelial cell of a gene selected from thoselisted in Table 5, wherein the gene is one that is expressed in a breastcancer epithelial cell at a substantially altered level compared to acompared to a normal breast epithelial cell; and (c) classifying thetest cell as: (i) a normal breast epithelial cell if the level ofexpression of the gene in the test cell is not substantially alteredcompared to a control level of expression for a normal breast epithelialcell; or (ii) a breast cancer epithelial cell if the level of expressionof the gene in the test cell is substantially altered compared to acontrol level of expression for a normal breast epithelial cell.
 76. Themethod of claim 75, wherein the gene is selected from the groupconsisting of PRDM14 and ZCCHC14. 77-78. (canceled)
 79. A method ofdiagnosis comprising: (a) providing a test sample of colon tissuecomprising a test epithelial cell; (b) determining the level ofexpression in the test epithelial cell of a gene selected from thoselisted in Table 2, wherein the gene is one that is expressed in a coloncancer epithelial cell at a substantially altered level compared to acompared to a normal colon epithelial cell; and (c) classifying the testcell as: (i) a normal colon epithelial cell if the level of expressionof the gene in the test cell is not substantially altered compared to acontrol level of expression for a normal colon epithelial cell; or (ii)a colon cancer epithelial cell if the level of expression of the gene inthe test cell is substantially altered compared to a control level ofexpression for a normal colon epithelial cell.
 80. The method of claim79, wherein the gene is selected from the group consisting of LHX3,TCF7L1, and LMX-1A. 81-82. (canceled)
 83. A method of diagnosiscomprising: (a) providing a test sample of breast tissue comprising atest stromal cell; (b) determining the level of expression in thestromal cell of a gene selected from those listed in Tables 7, 8, and10, wherein the gene is one that is expressed in a cell of the same typeas the test stromal cell at a substantially altered level when presentin breast cancer tissue than when present in normal breast tissue; and(c) classifying the test sample as: (i) normal breast tissue if thelevel of expression of the gene in the test stromal cell is notsubstantially altered compared to a control level of expression for acontrol cell of the same type as the test stromal cell in normal breasttissue; or (ii) breast cancer tissue if the level of expression of thegene in the test stromal cell is substantially altered compared to acontrol level of expression for a control cell of the same type as thetest stromal cell in normal breast tissue.
 84. (canceled)
 85. The methodof claim 83, wherein the gene is selected from the group consisting ofHOXD4, SLC9A3R1, and CDC32EP5.
 86. (canceled)
 87. The method of claim83, wherein the gene is Cxorf12. 88-89. (canceled)
 90. A method ofdetermining the likelihood of a cell being an epithelial cell or amyoepithelial cell, the method comprising: (a) providing a test cell;(b) determining the level of expression in the test sample of a geneselected from the group consisting of those identified by the MSDK tagslisted in Table 12; (c) determining whether the level of expression ofthe selected gene in the test sample more closely resembles the level ofexpression of the selected gene in (i) a control myoepithelial cell or(ii) a control epithelial cell; and (d) classifying the test cell as:(i) likely to be a myoepithelial cell if the level of expression of thegene in the test cell more closely resembles the level of expression ofthe gene in a control myoepithelial cell; or (ii) likely to be anepithelial cell if the level of expression of the gene in the test cellmore closely resembles the level of expression of the gene in a controlepithelial cell.
 91. The method of claim 90, wherein the gene isselected from the group consisting of LOC389333 and CDC42EP5.
 92. Amethod of diagnosis comprising: (a) providing a test cell; (b)determining the level of expression in the test cell of a PRDM14 gene;and (c) classifying the test cell as: (i) a normal cell if the level ofexpression of the gene in the test cell is not substantially alteredcompared to a control level of expression for a control normal cell ofthe same tissue as the test cell; or (ii) a cancer cell if the level ofexpression of the gene in the test cell is substantially alteredcompared to a control level of expression for a control normal cell ofthe same tissue as the test cell. 93-96. (canceled)
 97. A singlestranded nucleic acid probe comprising: (a) the nucleotide sequence of atag selected from those listed in Tables 2, 5, 7, 8, 10, 12, 15, and 16;or (b) the complement of the nucleotide sequence.
 98. An arraycomprising a substrate having at least 10 addresses, wherein eachaddress has disposed thereon a capture probe comprising: (a) a nucleicacid sequence consisting of a tag nucleotide sequence selected fromthose listed in Tables 2, 5, 7, 8, 10, 12, 15 and 16; or (b) thecomplement of the nucleic acid sequence.
 99. A kit comprising at least10 probes, each probe comprising: (a) a nucleic acid sequence comprisinga tag nucleotide sequence selected from those listed in Tables 2, 5, 7,8, 10, 12, 15 and 16; or (b) the complement of the nucleic acidsequence.
 100. A kit comprising at least 10 antibodies each of which isspecific for a different protein encoded by a gene identified by a tagselected from the group consisting of the tags listed in Tables 2, 5, 7,8, 10, 12, 15, and
 16. 101. A method of determining the likelihood of acell being a stem cell, an differentiated luminal epithelial cell or amyoepithelial cell, the method comprising: (a) providing a test cell;(b) determining the degree of methylation of one or more C residues in agene in the test cell, wherein the gene is selected from thoseidentified by the MSDK tags listed in Table 15 or 16, wherein the one ormore C residues are C residues in CpG sequences; and (c) comparing thedegree of methylation of the one or more residues to the degree ofmethylation of corresponding one or more C residues in a correspondinggene in a control stem cell, to the degree of methylation ofcorresponding one or more C residues in a corresponding gene in acontrol stem cell, and to the degree of methylation of corresponding oneor more C residues in a corresponding gene in a control differentiatedluminal epithelial cell, and to the degree of methylation ofcorresponding one or more C residues in a corresponding gene in acontrol myoepithelial cell, wherein the test cell is: (i) more likely tobe a stem cell if the degree of methylation in the test cell moreclosely resembles the degree of methylation in the control stem cell;(ii) more likely to be a differentiated luminal epithelial cell if thedegree of methylation in the test cell more closely resembles the degreeof methylation in the control differentiated luminal epithelial cell; or(iii) more likely to be an myoepithelial cell if the degree ofmethylation in the test cell more closely resembles the degree ofmethylation in the control myoepithelial cell 102-103. (canceled) 104.The method of claim 101, wherein the gene is selected from the groupconsisting of SOX13, SLC9A3R1, FNDC1, FOXC1, PACAP, DDN, CDC42EP5, LHX1,and HOXA10.
 105. A method of determining the likelihood of a cell beinga stem cell, a differentiated luminal epithelial cell, or amyoepithelial cell, the method comprising: (a) providing a test cell;(b) determining the level of expression in the test sample of a geneselected from the group consisting of those identified by the MSDK tagslisted in Table 15 or 16; (c) determining whether the level ofexpression of the selected gene in the test sample more closelyresembles the level of expression of the selected gene in (i) a controlstem cell, (ii) a control differentiated luminal epithelial cell, or(ii) a control myoepithelial cell; and (d) classifying the test cell as:(i) likely to be a stem cell if the level of expression of the gene inthe test cell more closely resembles the level of expression of the genein a control stem cell; (ii) likely to be a differentiated luminalepithelial cell if the level of expression of the gene in the test cellmore closely resembles the level of expression of the gene in a controlepithelial cell; or (iii) likely to be an myoepithelial cell if thelevel of expression of the gene in the test cell more closely resemblesthe level of expression of the gene in a control myoepithelial cell.106-107. (canceled)
 108. The method of claim 105, wherein the gene isselected from the group consisting of SOX13, SLC9A3R1, FNDC1, FOXC1,PACAP, DDN, CDC42EP5, LHX1, and HOXA10.